Papers
Topics
Authors
Recent
Search
2000 character limit reached

InternData Suite: Robotic & Sensor Fusion

Updated 6 February 2026
  • InternData Suite is a comprehensive infrastructure featuring InternData-A1 and Suite-IN for advanced robotics, sensor fusion, and sim-to-real transfer.
  • InternData-A1 employs an autonomous compositional pipeline with domain randomization to generate diverse synthetic trajectories and high-fidelity simulation data.
  • Suite-IN integrates multi-device inertial sensor data using deep fusion techniques to robustly predict pedestrian trajectories and reduce navigation errors.

InternData Suite refers to a family of large-scale datasets and frameworks designed to advance research in robotics and sensor-based navigation using high-fidelity synthetic or real multimodal data. Principal components of the InternData Suite are InternData-A1, an open synthetic dataset for Vision–Language–Action (VLA) model pre-training in robotic manipulation, and Suite-IN, a deep fusion framework for multi-device inertial navigation using commodity wearable devices. Both resources exemplify highly autonomous, modular pipelines and strong empirical generalization, targeting embodied intelligence and robust sensor fusion.

1. Components and Scope of InternData Suite

InternData Suite encompasses two distinct but thematically related systems:

  • InternData-A1: A large-scale, open-source synthetic dataset supporting VLA-based generalist robotic policies. InternData-A1 contains 637,498 trajectories, 401.4 million frames, and 7,433.9 hours of simulated data across 4 robot embodiments, 18 atomic skills, 70 unique tasks, and 227 indoor scenes, involving rigid, articulated, deformable, and fluid-object manipulation (Tian et al., 20 Nov 2025).
  • Suite-IN (also referenced as "InternData Suite" in (Sun et al., 2024)): A multi-device deep learning framework for inertial navigation, leveraging raw IMU time series from diverse wearable devices (e.g., iPhone, Watch, AirPods). Suite-IN fuses global and local motion cues to robustly estimate pedestrian trajectories, disentangling device-level noise from true body motion.

These resources position InternData Suite as critical infrastructure for scalable, reproducible research in embodied AI, sim-to-real robotics transfer, and sensor-based localization.

2. Dataset Construction and Simulation Pipeline in InternData-A1

InternData-A1 is constructed via a fully autonomous, four-stage compositional pipeline:

A. Environment Construction:

  • Assembles each episode by sampling one robot embodiment, one room asset, and relevant task objects with rich physical metadata.
  • Objects include 3,185 rigid models, 321 articulated objects, 20 deformables (garments), and fluids simulated via particles and isosurface rendering.

B. Task Composition via Atomic Skills:

  • Uses a library of 18 atomic skills (e.g., pick, place, push, pour, fold, handover, articulation primitives).
  • Each skill is a scripted policy producing waypoints in 6D end-effector space, enabling sequencing for complex, long-horizon tasks.

C. Domain Randomization:

  • Randomizes camera pose (±5 cm, ±5°), lighting (over 174 environment maps), object instances within categories, grasp poses (AnyGrasp top-40 candidates), and other scene parameters for every sample.

D. Trajectory Generation and Rendering:

  • Employs a CuRobo-based planner with minimum-jerk, joint-space motion interpolation.
  • Decoupled planning and rendering facilitate multi-stage cluster optimization, enabling a throughput of 209.7 hours of data per day on 8×RTX 4090 at <$0.003 per episode.
  • The modular pipeline and decoupling of stages support flexible skill chaining, scalable task instantiation, and minimal manual intervention.

3. Multi-Device Sensor Fusion and Feature Extraction in Suite-IN

Suite-IN aggregates IMU streams across platforms, performing:

  • Input Preprocessing: Time synchronization (Bluetooth, downsampled to 25 Hz), overlapping window segmentation (length $L),andbody−framecoordinatealignmentusingdevice−specificextrinsicrotations), and body-frame coordinate alignment using device-specific extrinsic rotations R_{i\rightarrow\mathrm{body}}.</li><li><strong>FeatureExtraction</strong>:<ul><li>Shallow−feature<ahref="https://www.emergentmind.com/topics/frequency−domain−multi−layer−perceptrons−mlps"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">MLPs</a>preprocessper−devicedata.</li><li>Eachdevicehas<ahref="https://www.emergentmind.com/topics/additive−parallel−correction"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">parallel</a>deep−branches:ashared−motionextractorforglobalembedding.</li> <li><strong>Feature Extraction</strong>: <ul> <li>Shallow-feature <a href="https://www.emergentmind.com/topics/frequency-domain-multi-layer-perceptrons-mlps" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">MLPs</a> preprocess per-device data.</li> <li>Each device has <a href="https://www.emergentmind.com/topics/additive-parallel-correction" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">parallel</a> deep-branches: a shared-motion extractor for global embedding H^i,andaprivate−motionextractorforlocalembedding, and a private-motion extractor for local embedding H^i_\mathrm{pr}.</li></ul></li><li><strong>FusionandRegression</strong>:<ul><li>Sharedembeddingsarefused(arithmeticmean,.</li> </ul></li> <li><strong>Fusion and Regression</strong>: <ul> <li>Shared embeddings are fused (arithmetic mean, H^0 = (1/J)\sum_i H^i)toformaglobalmotionrepresentation.</li><li>Velocityisregressedforeach) to form a global motion representation.</li> <li>Velocity is regressed for each H^iand and H^0,withtrajectoriesgeneratedbyintegrationovertime.</li><li>Acontrastivemoduledisentanglessharedandprivaterepresentationsby<ahref="https://www.emergentmind.com/topics/infonce−loss"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">InfoNCEloss</a>andorthogonalityconstraints.</li></ul></li></ul><h2class=′paper−heading′id=′training−methodology−objective−functions−and−implementation′>4.TrainingMethodology,ObjectiveFunctions,andImplementation</h2><h3class=′paper−heading′id=′interndata−a1−vla−model−pre−training′>InternData−A1VLAModelPre−Training</h3><ul><li>Model:Vision–Languageencoder(Paligemma,3Bparameters)forjointimage−textembedding, with trajectories generated by integration over time.</li> <li>A contrastive module disentangles shared and private representations by <a href="https://www.emergentmind.com/topics/infonce-loss" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">InfoNCE loss</a> and orthogonality constraints.</li> </ul></li> </ul> <h2 class='paper-heading' id='training-methodology-objective-functions-and-implementation'>4. Training Methodology, Objective Functions, and Implementation</h2><h3 class='paper-heading' id='interndata-a1-vla-model-pre-training'>InternData-A1 VLA Model Pre-Training</h3> <ul> <li>Model: Vision–Language encoder (Paligemma, 3B parameters) for joint image-text embedding h,<ahref="https://www.emergentmind.com/topics/action−expert"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">actionexpert</a>, <a href="https://www.emergentmind.com/topics/action-expert" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">action expert</a> f_\theta(h,t)trainedwithadiffusion−styleflow−matchingloss:</li></ul><p> trained with a diffusion-style flow-matching loss:</li> </ul> <p>\mathcal{L}(\theta) = \mathbb{E}_{x_0\sim q_0, \epsilon \sim \mathcal{N}(0,I), t\sim U[0,1]}\!\left\|f_\theta(x_t, t) - v_\epsilon(x_t, t)\right\|^2</p><p>with</p> <p>with x_t = \alpha_t x_0 + \sigma_t \epsilon,, v_\epsilon = (x_t-x_0)/\sigma_t.</p><ul><li>Pre−trainingmatchesofficial.</p> <ul> <li>Pre-training matches official \pi_0schedule:batchsize512,LR schedule: batch size 512, LR 5\times 10^{-5},680ksteps.</li><li>Fine−tuning:batch128,cosine−decayLR, 680k steps.</li> <li>Fine-tuning: batch 128, cosine-decay LR 2.5\times10^{-5},30k–100kstepsdependingontask.</li></ul><h3class=′paper−heading′id=′suite−in−sensor−fusion−model′>Suite−INSensorFusionModel</h3><ul><li>Jointlosscombinesvelocityregression(MSEoverglobalandper−devicevelocities),<ahref="https://www.emergentmind.com/topics/infonce"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">InfoNCE</a>contrastiveloss,andanorthogonalitypenaltybetweenprivateandsharedspaces:</li></ul><p>, 30k–100k steps depending on task.</li> </ul> <h3 class='paper-heading' id='suite-in-sensor-fusion-model'>Suite-IN Sensor Fusion Model</h3> <ul> <li>Joint loss combines velocity regression (MSE over global and per-device velocities), <a href="https://www.emergentmind.com/topics/infonce" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">InfoNCE</a> contrastive loss, and an orthogonality penalty between private and shared spaces:</li> </ul> <p>\mathcal{L} = \lambda_v\mathcal{L}_{\mathrm{vel}} + \lambda_c\mathcal{L}_{\mathrm{con}} + \lambda_o\mathcal{L}_{\mathrm{orth}}</p><p>wheretypical</p> <p>where typical \lambda$&#39;s are 1 (vel), 0.2 (contrast), 0.05 (orthogonality).</p> <ul> <li>Training data: 126 sequences, 6.5 h, 18 km walking across three subjects and environments. Window $L=200(8sat25Hz).Regularization:dropout( (8s at 25 Hz). Regularization: dropout (0.3inMLP),batch−norminconvolutionallayers.</li><li>Adamoptimizer,initialLR in MLP), batch-norm in convolutional layers.</li> <li>Adam optimizer, initial LR 10^{-3},decayatepochs40/60intotal60–80epochs.</li></ul><h2class=′paper−heading′id=′empirical−performance−and−benchmarking′>5.EmpiricalPerformanceandBenchmarking</h2><h3class=′paper−heading′id=′interndata−a1′>InternData−A1</h3><ul><li><strong>Simulationtasks</strong>(RoboTwin2.0suite,49tasks):</li></ul><p>∣Method∣Easyavg∣<ahref="https://www.emergentmind.com/topics/livecodebench−hard"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Hard</a>avg∣∣−−−−−−−−−−−−−−−−−−−−−−−−∣−−−−−−−−−−∣−−−−−−−−−−∣∣, decay at epochs 40/60 in total 60–80 epochs.</li> </ul> <h2 class='paper-heading' id='empirical-performance-and-benchmarking'>5. Empirical Performance and Benchmarking</h2><h3 class='paper-heading' id='interndata-a1'>InternData-A1</h3> <ul> <li><strong>Simulation tasks</strong> (RoboTwin 2.0 suite, 49 tasks):</li> </ul> <p>| Method | Easy avg | <a href="https://www.emergentmind.com/topics/livecodebench-hard" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Hard</a> avg | |------------------------|----------|----------| | \pi_0(scratch)∣23.5∣ (scratch) | 23.5 % | 2.5 % | | \pi_0(official)∣55.0∣ (official) | 55.0 % | 20.0 % | | \pi_0(InternData−A1)∣60.0<p>InternData−A1pre−trainingexceedstheofficial (InternData-A1)| 60.0 % | 26.5 % |</p> <p>InternData-A1 pre-training exceeds the official \pi_0baselineby+5.0and+6.5percentagepointsonEasyand<ahref="https://www.emergentmind.com/topics/bigcodebench−hard−dataset"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Hard</a>,respectively.</p><ul><li><strong>Real−worldtasks</strong>(5regular+4dexterous,30rolloutseach): baseline by +5.0 and +6.5 percentage points on Easy and <a href="https://www.emergentmind.com/topics/bigcodebench-hard-dataset" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Hard</a>, respectively.</p> <ul> <li><strong>Real-world tasks</strong> (5 regular + 4 dexterous, 30 rollouts each): \pi_0(intern)matches (intern) matches \pi_0(official)within±2–6percentagepoints(regular)and±3points(dexterous).</li><li><strong>Zero−shotsim−to−realtransfer</strong>:Fourrepresentativetasksreach (official) within ±2–6 percentage points (regular) and ±3 points (dexterous).</li> <li><strong>Zero-shot sim-to-real transfer</strong>: Four representative tasks reach 80\text{--}85\%successwith success with N_\mathrm{sim} \approx N_\mathrm{real}episodes(e.g.,SortRubbish,WipeStain).Severallong−horizontasksreach episodes (e.g., Sort Rubbish, Wipe Stain). Several long-horizon tasks reach >50\%successusingsyntheticpost−trainingonly,establishingrobustsim−to−realcapabilitiesforcomplexmanipulationwithsim:realsampleratio success using synthetic post-training only, establishing robust sim-to-real capabilities for complex manipulation with sim:real sample ratio \lesssim 8:1.</li></ul><h3class=′paper−heading′id=′suite−in′>Suite−IN</h3><ul><li><strong>Trajectoryerrormetrics</strong>(on 6.5h,18kmdataset):</li></ul><p>∣Method∣<ahref="https://www.emergentmind.com/topics/aspect−term−extraction−ate"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">ATE</a>(m)∣<ahref="https://www.emergentmind.com/topics/potential−runtime−error−rte"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">RTE</a>(m)∣∣−−−−−−−−−−−−−−−−−−−∣−−−−−−−−−∣−−−−−−−−−−∣∣<ahref="https://www.emergentmind.com/topics/parallel−distill−refine−pdr"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">PDR</a>(classic)∣7.5∣8.9∣∣IONet(phone)∣13.5∣11.3∣∣RoNIN(phone)∣8.1∣7.9∣∣DeepIT(2−device)∣11.8∣14.3∣∣ReWF1(3−device)∣5.9∣7.7∣∣Suite−IN∣<strong>3.40</strong>∣<strong>4.66</strong>∣</p><p>.</li> </ul> <h3 class='paper-heading' id='suite-in'>Suite-IN</h3> <ul> <li><strong>Trajectory error metrics</strong> (on ~6.5 h, 18 km dataset):</li> </ul> <p>| Method | <a href="https://www.emergentmind.com/topics/aspect-term-extraction-ate" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">ATE</a> (m) | <a href="https://www.emergentmind.com/topics/potential-runtime-error-rte" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">RTE</a> (m) | |-------------------|---------|----------| | <a href="https://www.emergentmind.com/topics/parallel-distill-refine-pdr" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">PDR</a> (classic) | 7.5 | 8.9 | | IONet (phone) | 13.5 | 11.3 | | RoNIN (phone) | 8.1 | 7.9 | | DeepIT (2-device) | 11.8 | 14.3 | | ReWF1 (3-device) | 5.9 | 7.7 | | Suite-IN | <strong>3.40</strong>| <strong>4.66</strong> |</p> <p>\mathrm{MPE}_{\mathrm{Suite-IN}} \approx 3.40\,\mathrm{m}comparedto compared to \approx 8.10\,\mathrm{m}$ for single-device RoNIN.

    • Robustness: Handles random removal of any device (Watch or AirPods) without catastrophic error spikes; generalizes to varied placement and non-worn use-cases; achieves up to 60% drift reduction in device-removal scenarios.

    6. Release, Open-Source Availability, and Future Directions

    Both principal components of InternData Suite emphasize reproducibility and extensibility:

    • InternData-A1 is fully open-source (Apache 2.0), with dataset hosted on HuggingFace and detailed documentation available online. Generation pipeline (CuRobo-based, USD + Hydra [Python] configs) will be released, enabling community-wide reproduction and extension (Tian et al., 20 Nov 2025).
    • Suite-IN is architected for scalable multimodal fusion: easily extendable to greater numbers or types of IMUs (e.g., shoe, chest sensors); ongoing research targets auto-calibration of device mounting parameters, integration of magnetometer/barometer data, and semi/self-supervised pre-training on unlabeled inertial data (Sun et al., 2024).

    A plausible implication is that InternData Suite will further drive advances in embodied agent generalization, large-scale sim-to-real transfer, and robust navigation using commodity sensing, especially as both pipelines lower the barrier to scalable dataset creation and flexible deployment.

    7. Strengths, Limitations, and Technical Challenges

    Strengths

    • Scalability and Diversity: InternData-A1 enables large-scale synthetic data generation across diverse tasks, embodiments, and physical environments with minimal manual curation.
    • Robustness and Flexibility: Suite-IN achieves strong accuracy even under dynamic device topologies, sensor perturbations, and unstructured settings.
    • Contrastive Disentanglement: Suite-IN’s separation of global and local features via contrastive objective enhances robustness to device-specific noise and local perturbations.

    Limitations

    • InternData-A1, despite its diversity, depends on accurate simulation of deformables and fluids—a domain known for its high sim-to-real gap. This suggests further improvements in physics-based simulation fidelity may enhance real-world transfer.
    • Suite-IN requires explicit calibration of device orientation relative to the body (extrinsic rotations Ri→bodyR_{i\rightarrow\mathrm{body}}), and Bluetooth-based synchronization may experience drift over extended runs (>10 minutes) unless periodically re-synchronized. IMU quality may be heterogeneous (AirPods IMU is particularly noisy).

    Future Directions

    • InternData Suite's future may include auto-calibration, integration of additional inertial sensors (magnetometer, barometer), and robust semi/self-supervised learning. Extending InternData-A1’s simulation capacities to external physical agents, object-centric manipulation, or variable gravity regimes could open new domains of embodied intelligence research.

    InternData Suite establishes a rigorously benchmarked, technically advanced foundation for scalable embodied AI and sensor fusion research, supported by open data, reproducible pipelines, and strong empirical results in generalist robotics and wearable-based navigation (Tian et al., 20 Nov 2025, Sun et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InternData Suite.