Video2Action: Automated Action Extraction

Updated 24 October 2025

Video2Action is a framework that converts raw video into discrete, annotated action sequences, reducing manual labeling efforts.
It integrates techniques from SSIM-based segmentation and deep tap location prediction to inverse dynamics modeling and diffusion-based synthesis.
Experimental results demonstrate enhanced annotation speed, improved agent training metrics, and robust cross-domain action transfer.

Video2Action encompasses a family of methodologies and systems aimed at extracting, detecting, annotating, or transferring human or agent actions from video data. These approaches span domains from mobile app interaction annotation, automated acquisition of training data for computer-use agents, and GUI action extraction, to flexible action transfer across heterogeneous scenarios. Methods range from lightweight computer vision and deep learning pipelines to sophisticated inverse dynamics modeling and video-action-conditioned diffusion. The goal is often to reduce manual annotation efforts, automate construction of action trajectories, or enable new generative control capabilities based on observed actions.

1. General Definition and Conceptual Framing

Video2Action, as formalized across recent literature, refers to automated approaches that transform raw video (typically screen recordings, user tutorials, or reference video demonstrations) into sequences of discrete action events, action-parameter tuples, or annotated event boundaries. This transformation may involve detecting “action scenes” (distinct UI transitions), localizing spatial and temporal action parameters (e.g., tap locations, pointer coordinates), or constructing new videos exhibiting transferred actions. The common objective is to operationalize human or system activities captured in video as structured actions—facilitating downstream tasks such as supervised agent training, data indexing, video annotation, or generative synthesis (Feng et al., 2023, Lu et al., 22 Oct 2025, Zhang et al., 6 May 2025).

Key attributes include:

Input: Unlabeled or weakly-labeled video (e.g., app tutorials, GUI screen recordings, human demonstration videos).
Output: Structured action labels (what), parameters (where, how), temporal boundaries (when), and possibly auxiliary intent signals (why or inner monologue).
Motivation: Automation of annotation, large-scale data mining for agent training, or enabling flexible action transfer in content generation.

2. Methodological Approaches

2.1. Heuristic and Deep-Learning Pipelines for UI Action Extraction

Video2Action pipelines designed for annotating app tutorial videos (Feng et al., 2023) utilize a two-phase structure:

Action Scene Generation:
- Segmentation of a video into “action scenes” using perceptual image-processing methods. Key techniques include computing frame-to-frame luminance differences in YUV color space, and employing the Structural Similarity Index (SSIM) for fine-grained detection of UI transitions.
- Recognition of certain patterns (e.g., sudden drops or palindromic similarity) aids in distinguishing TAP, SCROLL, and BACKWARD actions.
Action Location Prediction:
- Prediction of on-screen action coordinates (particularly for TAP actions) using a deep network composed of a ResNet-101 visual backbone, a region proposal network (RPN) with multi-scale/arbitrary aspect anchors, and a location prediction network performing both classification and regression.
- The regression objective is formalized as: $\text{Loss}_{\text{reg}-x} = \mathbb{1}_{x \notin [x_{\mathrm{lower}}, x_{\mathrm{upper}}]} \cdot \mathrm{smooth}_{L_1}(x - \frac{x_{\mathrm{lower}} + x_{\mathrm{upper}}}{2})$
- Robustness is enhanced via UI-specific data augmentation (element exchange among perceptual groups, metamorphic generation through reversible transitions), followed by DBSCAN-based clustering to consolidate redundant taps.

2.2. Inverse Dynamics Modeling for Internet-Scale GUI Action Extraction

Video2Action as the inverse dynamics module (IDM) (Lu et al., 22 Oct 2025) powers large-scale data mining from raw screen-recorded videos:

Video Grounding Model:
- Cast as a multi-class temporal event detection network, this component densely localizes GUI actions by predicting both action types ( $a_k$ ) and high-fidelity temporal boundaries ( $(t_k^s, t_k^e)$ ).
- The mapping is $f_\theta(v) \to S = \{(a_k, t_k^s, t_k^e)\}_{k=1}^K$ with $v$ as the raw video.
Action-Content Recognizer:
- For each detected segment $v_k$ , predicts detailed action parameters such as coordinates or typed text: $h_\phi(v_k) \to (\hat{a}_k, \pi_k)$ .
Operational Pipeline:
- Automatic web crawling and screen-detection filters (using cursor detection, e.g., YOLOv8x) are used to select GUI-centric screen recordings.
- The resulting action-parameterized trajectories are used for self-supervised/continued pretraining (26B tokens) and further supervised fine-tuning (8B tokens), producing measurable gains on benchmarks: task success rate increases from 9.3% (SFT-only) to 15.8% (OSWorld-Verified) and step accuracy from 64.1% to 69.3% (AgentNetBench).

2.3. Flexible Action Transfer and Video Synthesis

Recent generative approaches extend Video2Action to flexible action transfer, synthesizing new video content where a target subject (image) performs a reference action from an input video (Zhang et al., 6 May 2025):

RefAdapter (Reference-conditioned Adapter):
- Integrates into a diffusion-based video generation model.
- Replaces fixed first-frame conditioning with randomization, injecting the target image as the first-frame embedding for every training sample, thus supporting variations in subject layout or pose.
Frequency-Aware Action Extraction (FAE):
- Dynamically modulates attention to frequency-aware embeddings during the diffusion denoising trajectory.
- Early timesteps focus on low-frequency (motion) signals, later timesteps on high-frequency (appearance) details:
$W_{\text{attn}} = W_{\text{ori}} + W_{\text{bias}}$

with $W_{\text{bias}}$ schedule depending on denoising time $t$ .
Result:
- Enables action transfer even across subjects with mismatched skeletal structure, viewpoint, or layout, while maintaining identity consistency.

3. Evaluation Protocols and Metrics

Temporal and Spatial Precision:
- Precision@k for location prediction (e.g., Top-1: 50.14%, Top-3: 69.32%, Top-5: 81.89% for TAP location detection on Video2Action (Feng et al., 2023)).
- Video F1-score and Levenshtein score for alignment of action scene generation with ground truth (e.g., 81.67% and 86.41% respectively (Feng et al., 2023)).
Scale and Impact on Agent Training:
- Dataset size: 1.52 million labeled interaction steps mined from 39,000 YouTube videos (VideoAgentTrek’s Video2Action IDM (Lu et al., 22 Oct 2025)).
- Downstream metrics: substantial lifts in agent task success rate (from 9.3% to 15.8%) and step accuracy (from 64.1% to 69.3%) (Lu et al., 22 Oct 2025).
Generative Metrics:
- Motion fidelity, temporal consistency, appearance consistency, and text similarity for evaluating flexible action transfer (FlexiAct (Zhang et al., 6 May 2025)).

4. User Study and Practical Utility

In the case of app tutorial annotation (Feng et al., 2023):

Annotation Efficiency:
- Users employing Video2Action annotated videos significantly faster than manual annotation (e.g., 12 minutes vs. 22 minutes, time savings up to 85%).
- Participants found the system reduced cognitive load and improved ease of locating and marking action events.
Interface:
- Provides real-time feedback with video playback, automatic action thumbnails for navigation, and a ranked action location suggestion panel.

5. Broader Implications and Applications

Foundation for Large-Scale Computer-Use Agents:
- Enables mining of passive internet video for training data, reducing the cost and manual effort of annotation at scale (Lu et al., 22 Oct 2025).
- Supports robust agent pretraining for real-world graphical user interface (GUI) interaction tasks.
Accessibility and Workflow Enhancement:
- Drastically accelerates annotation for video-based documentation, tutorial creation, or software onboarding content (Feng et al., 2023).
- Can be integrated with video editing suites, automated testing frameworks, or bug reporting tools.
Flexible Action Synthesis:
- Facilitates customizable video and animation production, surpassing traditional pose-retargeting constraints (e.g., FlexiAct (Zhang et al., 6 May 2025)).
- Applicable in filmmaking, gaming, and augmented/virtual reality, supporting cross-domain and non-human motion transfer.
Potential for Generalization:
- Future work includes extending to more complex gestures (pinch/rotate), augmenting with animation cues, adaptation to other platforms (e.g., iOS, desktop), and application in accessibility features (video captioning for impaired users) (Feng et al., 2023).

6. Technical Summary Table

System/Paper	Core Methodology	Application Domain
Video2Action (Feng et al., 2023)	SSIM-based UI transition + deep tap location regressor	App tutorial annotation, user interaction mining
Video2Action IDM (Lu et al., 22 Oct 2025)	Inverse dynamics (video grounding + content recognition)	Mining action trajectories for agent pretraining
FlexiAct (Zhang et al., 6 May 2025)	Diffusion-based action transfer (RefAdapter + FAE)	Cross-domain video synthesis, action retargeting

Each approach applies Video2Action to distinct technical and application contexts, reflecting the breadth of research under this conceptual umbrella.

7. Future Directions

Increased Accuracy and Enrichment:
- Integrating richer features such as animation cues and more detailed segmentation or clustering can improve both scene detection and parameter extraction (Feng et al., 2023).
- Parameterizing composition and controlling augmentation processes more efficiently, perhaps in an end-to-end fashion, can further boost data efficiency (Zhang et al., 6 May 2025).
- Feed-forward action transfer, more extensive cross-domain evaluation, and tighter integration into production systems are identified as open challenges.
Broader Scope:
- Extending past dominant platforms (Android) to iOS, desktop, or cross-OS scenarios, as well as supporting AR/VR and complex multi-modal actions (Feng et al., 2023).
- Exploiting passively observed behaviors for new agent paradigms, including intent modeling and more diverse human-computer interactions (Lu et al., 22 Oct 2025).
Research Accessibility:
- Recent systems release open-source code and model weights to aid reproducibility and further exploration (e.g., FlexiAct at https://shiyi-zh0408.github.io/projectpages/FlexiAct/ (Zhang et al., 6 May 2025)).

In summary, Video2Action encompasses a set of methodologies that extract, annotate, transfer, or synthesize action information from video data. Spanning annotation, automated mining, and generative transfer, these systems enable scalable training for computer-use agents, accessible annotation for content creators, and new generative paradigms in video-based action control.