Temporal action segmentation for unscripted human videos

Develop robust and accurate temporal action segmentation methods for egocentric human hand activity videos that produce meaningful, atomic-level manipulation clips aligned with short-horizon robotic Vision-Language-Action training formats and language instructions.

Background

The paper aims to convert unscripted, in-the-wild egocentric human videos of hand activities into Vision-Language-Action (VLA) episodes aligned with robotic datasets. A central requirement is task alignment: meaningful segmentation and filtering of atomic-level human action sequences that match the short-horizon structure typical of robotic VLA data.

The authors note that this segmentation need is closely related to temporal action segmentation in videos. They explicitly state that temporal action segmentation remains an open problem and that existing methods do not meet their requirements, motivating their own heuristic based on speed minima of 3D hand trajectories.

References

This problem is closely related to temporal action segmentation from videos, which remains an open problem and there are no existing methods that meet our needs.

— Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos (2510.21571 - Li et al., 24 Oct 2025) in Section 1 (Introduction), Task alignment paragraph

Temporal action segmentation for unscripted human videos

Sponsor

Background

References

Related Problems