Temporal action segmentation for unscripted human videos
Develop robust and accurate temporal action segmentation methods for egocentric human hand activity videos that produce meaningful, atomic-level manipulation clips aligned with short-horizon robotic Vision-Language-Action training formats and language instructions.
References
This problem is closely related to temporal action segmentation from videos, which remains an open problem and there are no existing methods that meet our needs.
— Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos
(2510.21571 - Li et al., 24 Oct 2025) in Section 1 (Introduction), Task alignment paragraph