JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

Published 22 Apr 2026 in cs.RO | (2604.20100v2)

Abstract: Robotic autonomy in open-world environments is fundamentally limited by insufficient data diversity and poor cross-embodiment generalization. Existing robotic datasets are often limited in scale and task coverage, while relatively large differences across robot embodiments impede effective behavior knowledge transfer. To address these challenges, we propose JoyAI-RA, a vision-language-action (VLA) embodied foundation model tailored for generalizable robotic manipulation. JoyAI-RA presents a multi-source multi-level pretraining framework that integrates web data, large-scale egocentric human manipulation videos, simulation-generated trajectories, and real-robot data. Through training on heterogeneous multi-source data with explicit action-space unification, JoyAI-RA effectively bridges embodiment gaps, particularly between human manipulation and robotic control, thereby enhancing cross-embodiment behavior learning. JoyAI-RA outperforms state-of-the-art methods in both simulation and real-world benchmarks, especially on diverse tasks with generalization demands.

Abstract PDF Upgrade to Chat

Authors (62)

First 10 authors:

Summary

The paper introduces a multi-source, multi-level pretraining framework integrating diverse data (web, egocentric, simulation, real) for enhanced robotic policy learning.
The paper demonstrates state-of-the-art results with up to 90.48% simulation success and significant real-world improvements over baseline models.
The paper employs a unified action space that bridges embodiment gaps, enabling consistent, transferable manipulation across heterogeneous robot platforms.

JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

Motivation and Background

Robotic autonomy in unstructured, open-world environments remains constrained by two primary bottlenecks: limited data diversity and poor cross-embodiment generalization. Current approaches, especially those leveraging vision-language-action (VLA) architectures, have yielded substantial improvements in manipulation policy learning as dataset and model scales increase. However, the lack of heterogeneous, high-diversity robotic interaction data and the challenge of aligning knowledge across disparate robot morphologies impede the development of robust, truly general-purpose robots.

Multi-Source Multi-Level Pretraining Framework

JoyAI-RA introduces a hierarchical, multi-source pretraining regimen tailored for scalable VLA policy learning. The data pipeline aggregates four complementary sources:

Multi-Modal Web Data: Visual instruction-tuning corpora (e.g., VQA, captioning, Cosmos-Reason1, Cambrian-10M) supplying semantically dense supervision for grounding language and perception.
Egocentric Human Manipulation Videos (EgoLive): An in-house dataset surpassing prior work (e.g., EgoDex) in category diversity, annotation granularity, and temporal structure. Human hand trajectories, recovered via hand-pose pipelines, are retargeted to multiple robotic platforms, providing rich hierarchical decomposition of manipulation actions.
Simulation-Generated Trajectories: Large-scale synthetic data from platforms such as InternData-A1 and GenieSim3.0 supports pretraining with scalable action supervision and reduces simulation-to-reality gap via domain randomization.
Real-Robot Data: Aggregate from open-source datasets (e.g., Open-X-Embodiment, AgiBot-World) and in-house JDAgibot, capturing real interaction details including sensor noise and actuation uncertainties.

This structured data integration approach is explicitly designed to overcome the long-tail and embodiment generalization challenges by providing dense semantic grounding (from web and human data), scalable action diversity (from simulation), and deployment-aligned policy refinement (from real hardware data).

Unified Action Space for Cross-Embodiment Generalization

A crucial architectural advancement in JoyAI-RA is the definition of a unified action space. By encoding all end-effector actions and proprioceptive signals in a camera-relative frame with a fixed-length vector, the model achieves physically and semantically consistent action representations across diverse robot morphologies (from bimanual hands to single-arm grippers). Non-existent DOFs are masked per embodiment, allowing the architecture to learn from multi-morphology data without sacrificing action semantic coherence. This facilitates direct behavior policy transfer between human, simulated, and real robot domains, effectively bridging embodiment gaps.

Model Architecture and Training Paradigm

JoyAI-RA employs a modular, two-tiered architecture:

Vision-Language Backbone (VLM): Handles multimodal input, spatial reasoning, and token-level task decomposition. Pretraining incorporates VQA, embodied VQA, and instruction following tasks.
Perceiver-Based Action Expert: Decouples semantic/contextual understanding from continuous action generation. The module predicts temporally coherent action sequences via conditional velocity field modeling under a flow-matching framework.

Training proceeds in three stages:

VLM Co-Pretraining: Multimodal web and human data for broad semantic/embodied knowledge.
VLA Co-Pretraining: Focuses on continuous action learning from simulation, robot, and retargeted human data in the unified action space.
VLA Post-Training: Task and embodiment-specific domain specialization on real hardware/simulation data.

Experimental Evaluations

Simulation Benchmarks

On the RoboTwin 2.0 platform, JoyAI-RA achieves average success rates of 90.48% (Easy) and 89.28% (Hard), outperforming all baselines, including Motus, LingBot-VLA, and To.5 models. Similar improvements are observed in the RoboCasa GR1.Tabletop environment, setting a new state-of-the-art at 63.2% averaged across long-horizon, compositional manipulation tasks.

Real-World Robotic Manipulation

JoyAI-RA demonstrates robust real-robot deployment on the AgiBot G1 platform. It yields a cross-task average success rate of 0.74, markedly higher than To.5 (0.62). Gains are strongest on tasks requiring complex semantic grounding and multi-stage behavior (e.g., headphones placement, remedy packaging), with the model particularly excelling in precise target object manipulation and multi-object sequencing.

Ablations: The Contribution of Egocentric Human Data

Ablation experiments confirm that large-scale, richly-annotated egocentric human data (EgoLive) is a critical performance driver. Full EgoLive pretraining boosts simulation success rates by over 6% relative to robot-only data. Inclusion of in-domain human demonstrations further improves transferability on relevant real-world downstream tasks, especially for spatially complex or semantically intricate operations.

Further analysis shows that semantic coverage and task diversity in human data (EgoLive’s heavier-tailed noun/verb/adjective distributions) correlate with policy generalization, as evidenced by t-SNE coverage in visual-language feature space. The combination of EgoLive and EgoDex yields complementary gains, indicating diversity and long-horizon structure as key contributors to transferable embodied knowledge.

Multi-Stage Training and Simulation Data

Both VLM and VLA co-pretraining independently improve policy performance (from 81.3% baseline to 87.8/87.4%), and their combination achieves 90.4%. Removing simulation from Stage 2 leads to notable performance drops, underscoring the importance of simulated trajectories for cross-embodiment robustness and long-tail coverage.

Theoretical and Practical Implications

JoyAI-RA substantiates that structured multi-source pretraining with explicit action-space unification is essential for embodied generalization. This approach provides scalable, transferable manipulation policies capable of robust open-world generalization, including scenarios with significant distribution shift or unseen morphologies. The explicit decoupling of vision-linguistic understanding from continuous control, coupled with action-space masking, offers a paradigm for scaling embodied foundation models analogous to trends in LLM/VLM research.

Practically, this demonstrates a path forward for deploying a single policy backbone across fleets of heterogeneous robots while maintaining high manipulation success in unseen or dynamic real-world environments. The demonstrated value of richly-annotated egocentric human data further motivates larger, more diverse real-world data collection aligned to target deployment distributions.

Prospects for Future Research

Ongoing challenges highlighted by the results include low-level control precision under extreme embodiment mismatch and sequential reasoning for long-horizon, multi-step manipulation. While multi-source human-centric pretraining yields considerable transferable priors, limitations remain in scenarios requiring highly sensitive coordination or novel tool use. Prospective directions include expanding semantic and physical diversity of human and simulation data, more adaptive action-space representations, and tighter integration of scene graph and spatial reasoning modules for robotic planning.

Conclusion

JoyAI-RA advances embodied VLA foundation modeling by demonstrating that structured, multi-modal, and multi-stage pretraining with unified action-space alignment can substantially improve both simulated and real-world robotic manipulation. The system achieves state-of-the-art generalization across tasks and embodiments, with empirical evidence supporting both the efficacy of egocentric human video at scale and the necessity of simulation data for robust policy formation. The framework and analyses presented inform future development of generalist robot learning systems capable of practical open-world deployment across diverse robots and scenarios (2604.20100).

Markdown Report Issue