LEGS: Fine-Tuning Teleop-Free VLAs for Humanoid Loco-manipulation in an Embodied Gaussian Splatting World

Published 31 May 2026 in cs.RO | (2606.01458v1)

Abstract: Training vision-language-action (VLA) policies for humanoid loco-manipulation is constrained by the high cost and complexity of collecting human teleoperation demonstrations. VLA policies fine-tuned in simulators have, until now, failed to transfer effectively in humanoid loco-manipulation tasks. We present LEGS (Loco-manipulation via Embodied Gaussian Splatting), a hybrid simulator that composites a mesh foreground (robot, objects, props) over a photorealistic 3D Gaussian Splatting (3DGS) background reconstructed from a handheld scene capture. LEGS uses a procedural motion-primitive generator to synthesize labeled demonstrations at scale without human teleoperation, and a deterministic two-stage color calibration to align the rendered 3DGS image to the robot's deployment camera. On a Unitree G1 humanoid robot, across three pick-and-place tasks of increasing whole-body difficulty and three VLA backbones (psi_0, pi_0.5, GR00T N1.6), a policy trained purely on LEGS data matches or exceeds one trained on human teleoperation demos on every experiment. It also outperforms a mesh-only simulation baseline that ablates the effect of the 3DGS background, showing that photorealistic rendering is a key enabler for synthetic data transfer. Humanoid motion is recorded independently of scene appearance in LEGS, allowing the same auto-generated demonstrations to be re-rendered under new backgrounds and object meshes--covering a new scene at more than 15x lower cost than teleoperation--to augment training data for robustness to scene variations. Under combined object-and-scene appearance shift, the policy trained on re-rendered LEGS-AUG data maintains task success while the baseline trained on teleoperation data fails entirely. Our project page is located at https://legsvla.github.io/.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces a simulation framework that eliminates teleoperation by leveraging photorealistic 3D Gaussian Splatting for VLA fine-tuning in humanoid tasks.
It decouples physics and appearance, enabling rapid, operator-free data generation and re-rendering, significantly reducing labor costs.
Experimental results show that LEGS achieves state-of-the-art sim-to-real transfer with superior task success compared to traditional teleoperated methods.

LEGS: Fine-Tuning Teleop-Free VLAs for Humanoid Loco-manipulation in an Embodied Gaussian Splatting World

Introduction

Humanoid loco-manipulation requires both high-fidelity perception and robust control, yet current VLA fine-tuning protocols are fundamentally bottlenecked by the cost and inflexibility of teleoperated data collection. Traditional simulation-based approaches have been inadequate due to visual domain gaps, especially when mesh-only renderers are used. "LEGS: Fine-Tuning Teleop-Free VLAs for Humanoid Loco-manipulation in an Embodied Gaussian Splatting World" (2606.01458) presents a simulation framework that leverages photorealistic rendering—via 3D Gaussian Splatting (3DGS)—to synthesize large-scale training data without reliance on teleoperation or real robot data collection.

Figure 1: LEGS produces photorealistic, procedurally generated humanoid loco-manipulation data, facilitating robust sim-to-real transfer for VLA models.

The LEGS Simulator Architecture

LEGS is predicated on a dual decoupling which is pivotal for its data transfer properties:

Physics–Rendering Decoupling: LEGS employs a MuJoCo backend for physically plausible motion and a separate visual frontend that composites mesh-based dynamic foregrounds over a scanned, photorealistic 3DGS background.
Motion–Appearance Decoupling: Demonstrations, generated procedurally, are independent of visual context, permitting the same motion traces to be re-rendered under arbitrary backgrounds and object appearances with trivial computational cost.
Figure 2: LEGS pipeline overview: scene and object asset creation, physical simulation, procedural demonstration generation, and final rendering with color calibration.

The visual frontend achieves deployment-level photorealism by reconstructing backgrounds with 3DGS optimized from handheld scans and meshes (objects, robot) extracted with SAM3D. A two-stage color calibration procedure precisely matches the composited observations’ statistics to the target robot camera, eliminating residual ISP-induced domain gaps.

Figure 3: Comparison of egocentric images: real on-robot view, mesh-only baseline, and calibrated LEGS renders. Re-rendering under new assets is trivial and low-cost.

Procedural Motion Generation and Dataset Scalability

Instead of operator-driven teleoperation or per-task human retargeting, LEGS utilizes a procedural motion primitive system. High-level commands (such as Walk, Pick, Place) are parameterized with task-level arguments (e.g., relative object transformations), and successful episodes are programmatically validated before inclusion, guaranteeing only feasible and successful trajectories are used for policy fine-tuning.

Crucially, decoupled dynamics and rendering allow each verified motion trajectory to be re-rendered with arbitrary swapped asset sets, enabling rapid generation of robust, scene-diverse training corpora.

Experimental Protocol and Task Suite

LEGS is evaluated on a Unitree G1 humanoid platform across three pick-and-place tasks:

Task 1: Tabletop manipulation without locomotion.
Task 2: Loco-manipulation with navigation to a table before pick-and-place.
Task 3: Full-body loco-manipulation featuring walking, object pick, turning, and squatting to reach a target.
Figure 4: Progressive increase in loco-manipulation challenge across three benchmark pick-and-place tasks on the Unitree G1.

For each task, policies based on three SOTA VLA backbones ( $\psi_0$ , $\pi_{0.5}$ , GR00T N1.6) are fine-tuned using four conditions: Teleop~(50), LEGS~(50), LEGS~(200), and a mesh-only synthetic baseline (SAM3D~(200)). All fine-tuning is performed with identical SONIC whole-body controllers, ensuring comparability.

Key Results

Teleoperation Elimination and Superior Data Efficiency

Policies fine-tuned solely on LEGS data (either (50) or (200)) consistently match or surpass those trained on teleoperated trajectories across all tested backbones and tasks. Notably, on the most difficult Task~3, Teleop~(50) yields zero successes for all backbones, while LEGS~(200) achieves 2–6/10, and LEGS~(50) achieves 1–5/10, demonstrating robustness under extreme embodied control.

Photorealism’s Crucial Role in Sim-to-Real Transfer

Ablation against the mesh-only baseline demonstrates that photorealistic rendering is essential: SAM3D~(200) achieves only 33% mean task completion, whereas LEGS~(200) reaches 67%. The performance gap emerges predominantly in visually demanding, close manipulation stages, as documented by stage-wise success curves.

Figure 5: Stage-wise cumulative success rates reveal that photorealism in observation streams substantially boosts success in close-proximity tasks.

Re-Rendering Enables Cheap, Operator-Free Appearance Adaptation

Motion–appearance decoupling allows the entire training corpus to be re-rendered under new objects, backgrounds, and prompts in $\sim$ 0.1 hour of compute per condition, a >15-fold reduction in labor compared to teleop. In appearance generalization experiments, only policies trained on appropriately re-rendered (augmented) data maintain high TSR on compositional object-and-scene shifts; all other baselines collapse.

Figure 6: Real-robot transfer success rates vs. appearance variations—LEGS with augmentation remains robust across all conditions, while Teleop and mesh-only baselines fail out-of-distribution.

Limitations

LEGS’ color calibration relies on static camera parameters—deployment in highly dynamic lighting scenarios would require dynamic adaptation. Asset quality for backgrounds is sensitive to the quality of 3DGS scan inputs and static scene assumptions; transient or dynamic objects are not addressed. All evaluations are performed on a single hardware/software stack, and transfer to unseen robot morphologies remains unexamined.

Implications and Future Directions

LEGS fundamentally redefines the data-efficiency frontier for VLA fine-tuning in embodied humanoid loco-manipulation, obviating teleop labor and enabling broad, low-cost domain coverage. The separation of motion and appearance, with compositional re-simulation at scale, offers a template for highly adaptive, robust VLA deployment pipelines. Future extensions should target dynamic/interactive backgrounds, cross-robot and cross-camera generalization, and incorporation of procedural scene generation for more heterogeneous object classes and semantic tasks.

Conclusion

LEGS establishes a rigorous, photorealistic synthetic data pipeline for fine-tuning VLA policies that achieves state-of-the-art sim-to-real transfer in humanoid loco-manipulation—all without teleoperation. By explicitly separating and efficiently recombining physical and visual factors, it enables scalable demonstration generation, compelling sim-to-real transfer performance, and rapid adaptation to novel real-world environments and tasks (2606.01458).

Markdown Report Issue