InternData Suite is a comprehensive infrastructure featuring InternData-A1 and Suite-IN for advanced robotics, sensor fusion, and sim-to-real transfer.
InternData-A1 employs an autonomous compositional pipeline with domain randomization to generate diverse synthetic trajectories and high-fidelity simulation data.
Suite-IN integrates multi-device inertial sensor data using deep fusion techniques to robustly predict pedestrian trajectories and reduce navigation errors.
InternData Suite refers to a family of large-scale datasets and frameworks designed to advance research in robotics and sensor-based navigation using high-fidelity synthetic or real multimodal data. Principal components of the InternData Suite are InternData-A1, an open synthetic dataset for Vision–Language–Action (VLA) model pre-training in robotic manipulation, and Suite-IN, a deep fusion framework for multi-device inertial navigation using commodity wearable devices. Both resources exemplify highly autonomous, modular pipelines and strong empirical generalization, targeting embodied intelligence and robust sensor fusion.
1. Components and Scope of InternData Suite
InternData Suite encompasses two distinct but thematically related systems:
InternData-A1: A large-scale, open-source synthetic dataset supporting VLA-based generalist robotic policies. InternData-A1 contains 637,498 trajectories, 401.4 million frames, and 7,433.9 hours of simulated data across 4 robot embodiments, 18 atomic skills, 70 unique tasks, and 227 indoor scenes, involving rigid, articulated, deformable, and fluid-object manipulation (Tian et al., 20 Nov 2025).
Suite-IN (also referenced as "InternData Suite" in (Sun et al., 2024)): A multi-device deep learning framework for inertial navigation, leveraging raw IMU time series from diverse wearable devices (e.g., iPhone, Watch, AirPods). Suite-IN fuses global and local motion cues to robustly estimate pedestrian trajectories, disentangling device-level noise from true body motion.
These resources position InternData Suite as critical infrastructure for scalable, reproducible research in embodied AI, sim-to-real robotics transfer, and sensor-based localization.
2. Dataset Construction and Simulation Pipeline in InternData-A1
InternData-A1 is constructed via a fully autonomous, four-stage compositional pipeline:
A. Environment Construction:
Assembles each episode by sampling one robot embodiment, one room asset, and relevant task objects with rich physical metadata.
Objects include 3,185 rigid models, 321 articulated objects, 20 deformables (garments), and fluids simulated via particles and isosurface rendering.
B. Task Composition via Atomic Skills:
Uses a library of 18 atomic skills (e.g., pick, place, push, pour, fold, handover, articulation primitives).
Each skill is a scripted policy producing waypoints in 6D end-effector space, enabling sequencing for complex, long-horizon tasks.
C. Domain Randomization:
Randomizes camera pose (±5 cm, ±5°), lighting (over 174 environment maps), object instances within categories, grasp poses (AnyGrasp top-40 candidates), and other scene parameters for every sample.
D. Trajectory Generation and Rendering:
Employs a CuRobo-based planner with minimum-jerk, joint-space motion interpolation.
Decoupled planning and rendering facilitate multi-stage cluster optimization, enabling a throughput of 209.7 hours of data per day on 8×RTX 4090 at <$0.003 per episode.
The modular pipeline and decoupling of stages support flexible skill chaining, scalable task instantiation, and minimal manual intervention.
3. Multi-Device Sensor Fusion and Feature Extraction in Suite-IN
Suite-IN aggregates IMU streams across platforms, performing:
Input Preprocessing: Time synchronization (Bluetooth, downsampled to 25 Hz), overlapping window segmentation (length $L),andbody−framecoordinatealignmentusingdevice−specificextrinsicrotationsR_{i\rightarrow\mathrm{body}}.</li><li><strong>FeatureExtraction</strong>:<ul><li>Shallow−feature<ahref="https://www.emergentmind.com/topics/frequency−domain−multi−layer−perceptrons−mlps"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">MLPs</a>preprocessper−devicedata.</li><li>Eachdevicehas<ahref="https://www.emergentmind.com/topics/additive−parallel−correction"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">parallel</a>deep−branches:ashared−motionextractorforglobalembeddingH^i,andaprivate−motionextractorforlocalembeddingH^i_\mathrm{pr}.</li></ul></li><li><strong>FusionandRegression</strong>:<ul><li>Sharedembeddingsarefused(arithmeticmean,H^0 = (1/J)\sum_i H^i)toformaglobalmotionrepresentation.</li><li>VelocityisregressedforeachH^iandH^0,withtrajectoriesgeneratedbyintegrationovertime.</li><li>Acontrastivemoduledisentanglessharedandprivaterepresentationsby<ahref="https://www.emergentmind.com/topics/infonce−loss"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">InfoNCEloss</a>andorthogonalityconstraints.</li></ul></li></ul><h2class=′paper−heading′id=′training−methodology−objective−functions−and−implementation′>4.TrainingMethodology,ObjectiveFunctions,andImplementation</h2><h3class=′paper−heading′id=′interndata−a1−vla−model−pre−training′>InternData−A1VLAModelPre−Training</h3><ul><li>Model:Vision–Languageencoder(Paligemma,3Bparameters)forjointimage−textembeddingh,<ahref="https://www.emergentmind.com/topics/action−expert"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">actionexpert</a>f_\theta(h,t)trainedwithadiffusion−styleflow−matchingloss:</li></ul><p>\mathcal{L}(\theta) = \mathbb{E}_{x_0\sim q_0, \epsilon \sim \mathcal{N}(0,I), t\sim U[0,1]}\!\left\|f_\theta(x_t, t) - v_\epsilon(x_t, t)\right\|^2</p><p>withx_t = \alpha_t x_0 + \sigma_t \epsilon,v_\epsilon = (x_t-x_0)/\sigma_t.</p><ul><li>Pre−trainingmatchesofficial\pi_0schedule:batchsize512,LR5\times 10^{-5},680ksteps.</li><li>Fine−tuning:batch128,cosine−decayLR2.5\times10^{-5},30k–100kstepsdependingontask.</li></ul><h3class=′paper−heading′id=′suite−in−sensor−fusion−model′>Suite−INSensorFusionModel</h3><ul><li>Jointlosscombinesvelocityregression(MSEoverglobalandper−devicevelocities),<ahref="https://www.emergentmind.com/topics/infonce"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">InfoNCE</a>contrastiveloss,andanorthogonalitypenaltybetweenprivateandsharedspaces:</li></ul><p>\mathcal{L} = \lambda_v\mathcal{L}_{\mathrm{vel}} + \lambda_c\mathcal{L}_{\mathrm{con}} + \lambda_o\mathcal{L}_{\mathrm{orth}}</p><p>wheretypical\lambda$'s are 1 (vel), 0.2 (contrast), 0.05 (orthogonality).</p>
<ul>
<li>Training data: 126 sequences, 6.5 h, 18 km walking across three subjects and environments. Window $L=200(8sat25Hz).Regularization:dropout(0.3inMLP),batch−norminconvolutionallayers.</li><li>Adamoptimizer,initialLR10^{-3},decayatepochs40/60intotal60–80epochs.</li></ul><h2class=′paper−heading′id=′empirical−performance−and−benchmarking′>5.EmpiricalPerformanceandBenchmarking</h2><h3class=′paper−heading′id=′interndata−a1′>InternData−A1</h3><ul><li><strong>Simulationtasks</strong>(RoboTwin2.0suite,49tasks):</li></ul><p>∣Method∣Easyavg∣<ahref="https://www.emergentmind.com/topics/livecodebench−hard"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Hard</a>avg∣∣−−−−−−−−−−−−−−−−−−−−−−−−∣−−−−−−−−−−∣−−−−−−−−−−∣∣\pi_0(scratch)∣23.5∣\pi_0(official)∣55.0∣\pi_0(InternData−A1)∣60.0<p>InternData−A1pre−trainingexceedstheofficial\pi_0baselineby+5.0and+6.5percentagepointsonEasyand<ahref="https://www.emergentmind.com/topics/bigcodebench−hard−dataset"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Hard</a>,respectively.</p><ul><li><strong>Real−worldtasks</strong>(5regular+4dexterous,30rolloutseach):\pi_0(intern)matches\pi_0(official)within±2–6percentagepoints(regular)and±3points(dexterous).</li><li><strong>Zero−shotsim−to−realtransfer</strong>:Fourrepresentativetasksreach80\text{--}85\%successwithN_\mathrm{sim} \approx N_\mathrm{real}episodes(e.g.,SortRubbish,WipeStain).Severallong−horizontasksreach>50\%successusingsyntheticpost−trainingonly,establishingrobustsim−to−realcapabilitiesforcomplexmanipulationwithsim:realsampleratio\lesssim 8:1.</li></ul><h3class=′paper−heading′id=′suite−in′>Suite−IN</h3><ul><li><strong>Trajectoryerrormetrics</strong>(on 6.5h,18kmdataset):</li></ul><p>∣Method∣<ahref="https://www.emergentmind.com/topics/aspect−term−extraction−ate"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">ATE</a>(m)∣<ahref="https://www.emergentmind.com/topics/potential−runtime−error−rte"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">RTE</a>(m)∣∣−−−−−−−−−−−−−−−−−−−∣−−−−−−−−−∣−−−−−−−−−−∣∣<ahref="https://www.emergentmind.com/topics/parallel−distill−refine−pdr"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">PDR</a>(classic)∣7.5∣8.9∣∣IONet(phone)∣13.5∣11.3∣∣RoNIN(phone)∣8.1∣7.9∣∣DeepIT(2−device)∣11.8∣14.3∣∣ReWF1(3−device)∣5.9∣7.7∣∣Suite−IN∣<strong>3.40</strong>∣<strong>4.66</strong>∣</p><p>\mathrm{MPE}_{\mathrm{Suite-IN}} \approx 3.40\,\mathrm{m}comparedto\approx 8.10\,\mathrm{m}$ for single-device RoNIN.
Robustness: Handles random removal of any device (Watch or AirPods) without catastrophic error spikes; generalizes to varied placement and non-worn use-cases; achieves up to 60% drift reduction in device-removal scenarios.
6. Release, Open-Source Availability, and Future Directions
Both principal components of InternData Suite emphasize reproducibility and extensibility:
InternData-A1 is fully open-source (Apache 2.0), with dataset hosted on HuggingFace and detailed documentation available online. Generation pipeline (CuRobo-based, USD + Hydra [Python] configs) will be released, enabling community-wide reproduction and extension (Tian et al., 20 Nov 2025).
Suite-IN is architected for scalable multimodal fusion: easily extendable to greater numbers or types of IMUs (e.g., shoe, chest sensors); ongoing research targets auto-calibration of device mounting parameters, integration of magnetometer/barometer data, and semi/self-supervised pre-training on unlabeled inertial data (Sun et al., 2024).
A plausible implication is that InternData Suite will further drive advances in embodied agent generalization, large-scale sim-to-real transfer, and robust navigation using commodity sensing, especially as both pipelines lower the barrier to scalable dataset creation and flexible deployment.
7. Strengths, Limitations, and Technical Challenges
Strengths
Scalability and Diversity: InternData-A1 enables large-scale synthetic data generation across diverse tasks, embodiments, and physical environments with minimal manual curation.
Robustness and Flexibility: Suite-IN achieves strong accuracy even under dynamic device topologies, sensor perturbations, and unstructured settings.
Contrastive Disentanglement: Suite-IN’s separation of global and local features via contrastive objective enhances robustness to device-specific noise and local perturbations.
Limitations
InternData-A1, despite its diversity, depends on accurate simulation of deformables and fluids—a domain known for its high sim-to-real gap. This suggests further improvements in physics-based simulation fidelity may enhance real-world transfer.
Suite-IN requires explicit calibration of device orientation relative to the body (extrinsic rotations Ri→body​), and Bluetooth-based synchronization may experience drift over extended runs (>10 minutes) unless periodically re-synchronized. IMU quality may be heterogeneous (AirPods IMU is particularly noisy).
Future Directions
InternData Suite's future may include auto-calibration, integration of additional inertial sensors (magnetometer, barometer), and robust semi/self-supervised learning. Extending InternData-A1’s simulation capacities to external physical agents, object-centric manipulation, or variable gravity regimes could open new domains of embodied intelligence research.
InternData Suite establishes a rigorously benchmarked, technically advanced foundation for scalable embodied AI and sensor fusion research, supported by open data, reproducible pipelines, and strong empirical results in generalist robotics and wearable-based navigation (Tian et al., 20 Nov 2025, Sun et al., 2024).