VideoAgentTrek: Scalable Pretraining Pipeline

Updated 24 October 2025

VideoAgentTrek is a scalable pretraining pipeline that extracts structured GUI interactions from unlabeled screen-recorded videos for training multimodal agents.
It employs dense, prompt-free event segmentation, action parameterization, and inner monologue generation to create ReAct-style trajectories.
The approach minimizes manual annotation costs while significantly boosting benchmark performance in complex, long-horizon tasks.

VideoAgentTrek is a scalable pretraining pipeline for computer-use agents that autonomously extracts large-scale, structured GUI interaction trajectories from unlabeled, publicly available screen-recorded videos. It addresses the central challenge of overcoming the prohibitive cost of manual annotation for interface-centric agent training by transforming raw, passive video demonstrations into explicit, action-parameterized supervision suitable for downstream fine-tuning of multimodal AI agents (Lu et al., 22 Oct 2025).

1. Pipeline Overview and Motivation

VideoAgentTrek is designed around the insight that internet tutorial videos, particularly large quantities of screen-recorded demonstrations, contain latent but untapped sequences of interface actions. However, these videos lack explicit stepwise labeling (e.g., click/tap coordinates, typed inputs), necessitating automated, high-fidelity event extraction. The system’s pipeline converts implicit, “shown” demonstrations into ReAct-style multimodal trajectories, capturing not only visual context and action parameters but also generating an inner monologue that reflects local intent and reasoning.

The pipeline avoids expensive, manual human annotation by employing trained models to perform step segmentation, action parameterization, and paraphrased reasoning, thereby unlocking billions of potential training examples at web scale—demonstrably closing the gap between human-annotated and passively recorded training corpora.

2. Video2Action: Inverse Dynamics Extraction

Central to the pipeline is Video2Action, an inverse dynamics module (IDM) that decomposes raw video into stepwise, parameterized supervision suitable for agent training. Video2Action consists of:

Video Grounding Model: This model performs dense event detection over the video sequence, identifying GUI atomic actions (click, drag, scroll, keyboard entry, etc.) along with their start and end timestamps. Formally, for an input clip $v$ of length $T$ :

$f_{θ}(v) \rightarrow S = \{(a_{k}, t_{k}^s, t_{k}^e)\}_{k=1}^K$

This dense, prompt-free event detection ensures that even temporally local interface behaviors are recovered without manual cues.

Action-Content Recognizer: For each detected action segment $v_{k}$ , a parameterization head maps the action to its structured type $\hat{a}_k$ and arguments $\pi_k$ (such as screen coordinates, typed strings):

$h_{φ}(v_k) \rightarrow (\hat{a}_k, \pi_k)$

This structured extraction supports downstream training that requires precise spatial/semantic mapping from GUI events to agent action spaces. The action-content recognizer is tailored for high fidelity, focusing on spatial precision and text entry accuracy.

Inner Monologue Generation: For each step, a short intent monologue $r_k$ is produced by paraphrasing the action and its effect (via models such as GPT-5 Medium), using as input step metadata, screenshots, and associated ASR text windows. This additional narrative layer provides fine-grained context for hierarchical planning and error recovery.

The output is a ReAct-style trajectory record:

$\mathcal{R} = \{(I_k, r_k, a_k, \pi_k)\}_{k=1}^K$

where $I_k$ are screenshot observations, $r_k$ is the local intent, $a_k$ is the action type, and $\pi_k$ encodes the action parameters.

3. Data Collection, Filtering, and Dataset Composition

Data mining in VideoAgentTrek proceeds in two main phases:

Seed and Channel Expansion: Initial queries (e.g., “Excel tutorial”) yield a set of seed videos and channels. If a channel contains >80% GUI-focused videos, all its uploads are included, rapidly expanding the candidate set. This strategy builds a pool exceeding 55,000 videos (10k+ hours).
Automatic Screen-based Filtering: To ensure trajectory quality, ScreenFilter is applied. This YOLOv8x-based cursor detector retains only segments where the cursor is visible in at least 80% of frames (minimum interval: 6 s), filtering out irrelevant or camera-facing videos. After filtering, the pipeline retains 7,377 hours of high-confidence on-screen GUI interactions.

Further refinement leverages video metadata (titles, tags, descriptions) to confirm that ~70% of the content is structured as stepwise tutorials, maximizing the opportunity for meaningful action extraction.

The processed dataset encompasses 39,000 unique videos, yielding approximately 1.52 million extracted action steps, with an average trajectory length of 39 steps. This step density enables modeling of long-horizon, realistic use cases for complex interfaces.

4. Training Protocols and Performance Impact

The extracted data is used in a two-stage downstream training protocol:

Continued Pretraining: Select vision-LLMs such as Qwen2.5-VL-7B-Instruct are further trained on the extracted trajectories, leveraging the large volume (tens of billions of tokens) and multimodal structure. Distributed training uses standard hardware (up to 64 H100 GPUs, batch sizes on the order of ~1,200, multiple epochs over 1.52 million interaction steps).
Supervised Fine-Tuning: Following pretraining, supervised fine-tuning is performed using a smaller, human-annotated set to enhance robustness and accuracy.

Performance is evaluated on two public benchmarks:

Benchmark	SFT-only Baseline	VideoAgentTrek Pretraining
OSWorld-Verified	9.3%	15.8%
AgentNetBench (step accuracy)	64.1%	69.3%

On OSWorld-Verified (Ubuntu desktop tasks), pretraining yields a 70% relative improvement. AgentNetBench also reflects consistent gains across a variety of tasks. This suggests that the extracted data significantly improves planning, error recovery, and robustness, particularly in dynamic and long-horizon tasks.

5. Scalability, Automation, and Data Fidelity

A defining feature of VideoAgentTrek is its web-scale, automated processing:

Automation: Minimal manual input is required beyond initial seed and channel selection. Key steps, from video harvesting to cursor-based filtering, event segmentation, parameter extraction, and monologue generation, are fully automated. Periodic human checks ensure dataset quality.
Scalability: The pipeline can be extended to new domains, operating system variants, and interface styles by updating filtering heuristics or event recognizers. Its throughput supports the ingestion and processing of tens of thousands of hours of tutorial content with modest resources.

This high degree of automation and scalability implies that the approach is not tethered to a particular platform or task distribution, and supports updates as new task domains (for example, productivity software, creative tools) emerge on public video platforms.

6. Technical Insights and Methodological Details

Some critical technical details include:

Dense, Prompt-Free Action Segmentation: The event detector operates without external prompts, using learned visual and temporal cues to demarcate GUI action boundaries, formalized as:

$f_{θ}(v) \rightarrow S = \{(a_k, t^s_k, t^e_k)\}_{k=1}^K, \quad 0 \le t^s_k < t^e_k \le T$

Action Parameterization: Parameter extraction is formalized as:

$h_{φ}(v_k) \rightarrow (\hat{a}_k, \pi_k)$

where $\pi_k$ may encode, for example, $(x, y)$ screen coordinates for clicks, bounding boxes, or typed strings.

Monologue Generation: For each step, $r_k$ is produced by fusing observed modalities and ASR context into a concise natural language statement (e.g., "Open the File menu to begin importing data"), enhancing downstream language modeling of intent and reasoning.
Continued Pretraining Configurations: Training is performed with large batch sizes, mixed precision, and tuning epochs to maximize stability, adapting hyperparameters (e.g., learning rate schedules) established for large vision-LLMs in similar regimes.

The pipeline maintains a ReAct-style multimodal trajectory structure:

$\mathcal{R} = \{(I_k, r_k, a_k, \pi_k)\}_{k=1}^K$

which encapsulates screenshots before and after the action, step-wise intent monologues, action types, and parameterizations for structured agent learning.

7. Significance and Implications

VideoAgentTrek demonstrates that passive internet screen-recorded tutorials constitute a rich, underexploited resource for pretraining multimodal agents. By introducing automated, dense event detection, high-fidelity parameter extraction, and intent paraphrasing, it closes the gap between hand-annotated interactive datasets and abundant raw video corpora.

A plausible implication is that this approach can be generalizable to other human-computer interaction domains (e.g., mobile apps, web navigation, code editors) where labeled interaction data is scarce. It also provides a foundation for scaling interactive agent research, as its output has been empirically shown to improve both online and offline benchmark performance in desktop automation, demonstrating strong transfer and robustness absent in smaller, manually curated datasets.

The methodology sets a practical precedent for future pipelines aimed at leveraging large-scale, passively collected multimodal data in agent learning, reinforcing the utility of automated event mining and weakly supervised trajectory parsing within real-world AI agent systems.

PDF Markdown Chat (Pro)

References (1)

VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to VideoAgentTrek.