Dice Question Streamline Icon: https://streamlinehq.com

Is trajectory diversity the core driver of effective VLA pre-training?

Establish whether trajectory diversity, arising from multi-skill compositions and articulated manipulation within the InternData-A1 dataset, is the primary factor driving effective pre-training performance of Vision-Language-Action models, as opposed to alternative dataset composition factors such as the prevalence of pick-and-place tasks, base tasks, or long-horizon tasks.

Information Square Streamline Icon: https://streamlinehq.com

Background

In the Data Component Ablation, the dataset is partitioned into four components: pick-and-place (PnP), articulation (Art), base tasks (Base), and long-horizon tasks (Long). Removing any component reduces downstream success, with larger drops observed when excluding Base, Long, or Art than when excluding PnP. Based on these observations, the authors hypothesize that trajectory diversity, especially from multi-skill compositions and articulated interactions, might be the key factor behind effective pre-training.

However, the paper does not provide a definitive causal analysis of this hypothesis and explicitly defers a rigorous investigation to future work, framing it as an unresolved question requiring further paper.

References

At a higher level, combining these two findings, we hypothesize that trajectory diversity may serve as the core drive of effective pre-training. We leave a rigorous investigation for future research.

InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy (2511.16651 - Tian et al., 20 Nov 2025) in Section 6: Data Analysis, Data Component Ablation