EgoScaler: Egocentric Video to 6DoF Trajectories
- EgoScaler is a framework that converts raw egocentric videos and natural language instructions into explicit 6DoF object trajectories for Vision-Language-Action pre-training.
- It employs a four-stage pipeline—object localization, segmentation/tracking, 3D reconstruction, and rotation estimation—with automatic curation to ensure data quality.
- Evaluations in simulation and on real robots show that pre-training with EgoScaler improves manipulation success and competes with expert teleoperation datasets.
Searching arXiv for the specified papers and closely related work to ground the article. EgoScaler is a fully automatic framework for converting raw egocentric video clips plus natural-language action descriptions into explicit $6$DoF object manipulation trajectories suitable for Vision-Language-Action (VLA) pre-training. It was introduced in "Developing Vision-Language-Action Model from Egocentric Videos" and is motivated by the mismatch between the scale of available egocentric video and the cost of real-robot teleoperation datasets. In the reported formulation, EgoScaler extracts object-centric trajectories without auxiliary recordings such as detailed hand-pose sensors or motion-capture systems, applies automatic refinement to noisy trajectories, and is then used to construct a large-scale pre-training corpus for a VLA evaluated in simulation and on a real robot (Yoshida et al., 26 Sep 2025).
1. Problem setting and motivation
The central problem addressed by EgoScaler is data scarcity for VLA pre-training. The paper situates existing large VLA datasets in human teleoperation or expert demonstrations on real robots, citing examples such as OXE, , Fractal, and BridgeData. Those data sources require costly hardware setups, expert time, and manual labeling, and the exposition states that scaling beyond tens of thousands of episodes is prohibitively expensive (Yoshida et al., 26 Sep 2025).
Egocentric video offers a different regime. First-person recordings from devices such as Quest 3, Aria, and Vision Pro, as well as public benchmarks including Ego4D, EPIC-KITCHENS, HD-EPIC, and Nymeria, capture hand-object interaction at close range and are rapidly growing. The paper characterizes them as a source with orders-of-magnitude more coverage of daily tasks “in the wild.” The technical difficulty is that raw egocentric video lacks the structured supervision typically required for imitation learning: camera motion is noisy, there is no ground-truth $6$DoF object pose or hand pose, intrinsic and extrinsic calibration are not assumed, and start/end timestamps and action labels are not generally available unless manually annotated (Yoshida et al., 26 Sep 2025).
EgoScaler is therefore framed as an answer to a specific question: whether VLAs can be trained directly from raw egocentric videos rather than from teleoperation traces or from egocentric data augmented with auxiliary recordings. A key implication of the work is that the bottleneck is not only the availability of human manipulation footage, but the ability to convert that footage into an action representation compatible with robot policy learning.
2. Core pipeline and trajectory representation
EgoScaler takes as input a raw RGB clip and an instruction , for example “pick up the carrot and place it in the bowl,” and produces an end-effector-style trajectory
The pipeline is described as having four main modules (Yoshida et al., 26 Sep 2025).
First, temporal and object localization identifies the relevant segment of the clip and the manipulated object. GPT-4o is used to predict the start and end frames and the noun phrase corresponding to the manipulated object. This is a notable design choice: EgoScaler does not assume pre-segmented manipulation clips or pre-given target objects.
Second, segmentation and pixel tracking isolate the target object across frames. The framework applies Grounding DINO as an open-vocabulary detector and SAM as a mask refiner, then seeds SpatialTracker on the mask to obtain $2$D feature tracks . The resulting representation is not yet geometric action; it is a tracked image-plane description of the object over time.
Third, EgoScaler reconstructs 0D position using monocular multi-view stereo, specifically COLMAP/SfM+MVS, to recover a local 1D point cloud of the segmented object at each frame. Each point cloud is registered back into the camera frame of 2 by solving
3
where 4 denotes the pinhole projection 5. The output of this stage is a sequence of 6D centroids 7 (Yoshida et al., 26 Sep 2025).
Fourth, rotation estimation computes the relative rotation between consecutive point clouds using the Kabsch solution: 8 The recovered orientation can be expressed as roll-pitch-yaw or as a 9D rotation representation 0. In the dataset eventually used for pre-training, the trajectory granularity is 1D per time step, 2, which aligns the extracted egocentric motion with continuous robot action targets (Yoshida et al., 26 Sep 2025).
A common misconception is that “raw egocentric video” implies direct end-to-end policy learning from unstructured pixels alone. EgoScaler does not make that move. It inserts an explicit geometric reconstruction layer between video and policy training, and its output is an object-centric trajectory rather than a latent action code.
3. Refinement, filtering, and dataset construction
Because the pipeline operates on unconstrained first-person video, automatic curation is central rather than auxiliary. The paper specifies four refinement steps (Yoshida et al., 26 Sep 2025).
The first is a travel distance filter. EgoScaler computes
3
and discards trajectories if 4, with an example threshold 5. The stated rationale is that large travel distance often indicates registration outliers.
The second is the background track similarity (BGTS) filter. Object-mask centroids 6 and background-region centroids 7 are tracked, corresponding velocities are defined as 8 and 9, and the similarity score is
$6$0
Trajectories are discarded if $6$1, with the exposition reporting an empirically chosen threshold $6$2. This operationalizes a simple failure mode: if the object moves too similarly to the background, apparent motion may be dominated by camera motion rather than manipulation.
The third step is smoothing, implemented as a five-frame moving average over $6$3D positions to suppress depth jitter. The fourth is a length threshold that discards clips shorter than $6$4 frames or longer than $6$5 frames in order to keep temporal scale consistent (Yoshida et al., 26 Sep 2025).
These procedures were applied to four large egocentric collections: Ego4D, Ego-Exo4D, HD-EPIC, and Nymeria. The reported dataset statistics are as follows.
| Quantity | Value | Description |
|---|---|---|
| Initial episodes | 124,559 clips | Before auto-filtering |
| Usable trajectories | 45,157 | After auto-filtering |
| Average clip length | $6$6 frames | $6$7M total frames |
| Coverage | 313 verbs, 1,217 nouns | Distinct classes/categories |
The paper describes this resulting corpus as a new large-scale dataset for VLA pre-training (Yoshida et al., 26 Sep 2025). A plausible implication is that the main contribution is not just a trajectory extractor, but a conversion pipeline that turns heterogeneous egocentric video archives into a policy-ready action dataset.
4. VLA pre-training formulation
Each episode is formatted as $6$8. At step $6$9, the vision input is 0, the language input is tokenized instruction 1, the proprioceptive input is 2, and the action target is 3 (Yoshida et al., 26 Sep 2025).
The policy architecture is 4, described in the exposition as using a frozen VLM backbone, PaliGemma, together with a flow-matching diffusion head for continuous action generation. The pre-training objective combines an imitation loss over trajectory targets with an optional language-trajectory alignment term: 5 and
6
This design locates EgoScaler within explicit-action VLA training rather than latent-only pre-training (Yoshida et al., 26 Sep 2025).
That distinction matters in the comparison with LAPA. The baseline labeled LAPA is described as latent-action pre-training on the same EgoScaler clips or on Something-Something V2. The empirical contrast presented by the paper suggests that explicit recovered trajectories from egocentric video can be more useful than latent action abstractions when the downstream target is robot manipulation, at least in the reported setup.
5. Evaluation and empirical findings
The experimental program includes both simulation and real-robot evaluation. In simulation, the benchmark is SIMPLER BridgeData V2 with four pick-and-place pairs, post-training on 7 successful rollouts per task from a finetuned 8, and evaluation over 9 attempts per task. In the real-robot setting, the platform is an ALOHA bimanual system running four language-conditioned pick-and-place tasks involving carrots or onions placed into pots or bowls simultaneously on a table. The real-robot benchmark uses 0 manually collected episodes per task and evaluates 1 trials per task, with success scored as 2 for correct grasp and 3 for correct placement (Yoshida et al., 26 Sep 2025).
The paper reports three main quantitative findings. First, pre-training on the EgoScaler dataset improves over training from scratch. Real-robot performance rises from 4 total success for scratch to 5 average success with EgoScaler pre-training, while simulation shows a 6–7 absolute gain. Second, performance is competitive with pre-training on real-robot datasets. The reported plain real-robot pre-training results are Fractal 8, BC-Z 9, and BridgeData V2 0, while EgoScaler alone achieves 1. Third, combining the egocentric dataset with robot data yields the strongest reported result: EgoScaler + BridgeData V2 reaches 2 successes (Yoshida et al., 26 Sep 2025).
The ablations emphasize that dataset quality controls downstream performance. Increasing scale from 3K to 4K to 5K episodes boosts real-robot success from 6 to 7, although simulation peaks at 8K, which the exposition attributes to visual noise. Similarly, BGTS threshold selection trades dataset size against quality: a threshold of 9 yields 0K episodes and 1 real-robot success; a threshold of 2 yields 3K episodes but only 4 real-robot success; and the reported optimum is 5, giving 6K episodes (Yoshida et al., 26 Sep 2025).
A recurring misunderstanding is that egocentric video pre-training obviates robot data. The evidence reported here does not support that interpretation. Instead, EgoScaler is presented as competitive with expert-teleoperation datasets and complementary to them, with the best result obtained by combining egocentric and real-robot sources.
6. Limitations, scope, and relation to EgoScale
The stated scope of EgoScaler is single-object pick-and-place. The paper explicitly identifies several limitations: extending to multi-object settings, articulated objects, or tool use would require richer segmentation and stronger occlusion handling; no gripper or finger state is recovered; and bridging the visual domain gap between egocentric real videos and simulator environments remains open. It also suggests future use of multi-modal egocentric streams such as IMU, audio, and gaze for improved pose estimation and action understanding (Yoshida et al., 26 Sep 2025).
These limitations clarify what EgoScaler is and is not. It is not a full human-hand retargeting system, because it does not reconstruct gripper or finger state. It is not a general dexterous manipulation framework in the sense of high-DoF hand control from human pose labels. Rather, it is an object-trajectory extraction and pre-training pipeline targeted at VLA learning from raw egocentric video.
This distinction becomes particularly important in relation to the later framework EgoScale. EgoScale is a separate human-to-dexterous-manipulation transfer system trained on over 7 hours of action-labeled egocentric human video, with per-frame labels derived from camera SLAM and a hand-pose estimator, and a two-stage recipe consisting of large-scale human pretraining followed by aligned human-robot mid-training (Zheng et al., 18 Feb 2026). Whereas EgoScaler emphasizes extraction of explicit 8DoF object trajectories from raw videos without auxiliary recordings, EgoScale emphasizes dexterous transfer with relative wrist 9 deltas and retargeted $2$0-DoF hand joint angles, and reports a log-linear scaling law between human data scale and validation loss (Zheng et al., 18 Feb 2026).
Taken together, the two frameworks define adjacent but distinct points in the design space. EgoScaler suggests that raw egocentric videos can be converted into explicit trajectories sufficient for pre-training a state-of-the-art VLA without motion-capture or hand-pose sensors (Yoshida et al., 26 Sep 2025). EgoScale suggests that, when action labels and hand-pose processing are available at very large scale, egocentric human data can also support dexterous, embodiment-agnostic transfer (Zheng et al., 18 Feb 2026). A plausible implication is that future work may combine EgoScaler-style automatic trajectory induction with EgoScale-style alignment and dexterous retargeting to broaden the action vocabulary beyond single-object pick-and-place.