Jointly scaling human data and model capacity for improved planning and composition

Determine whether jointly scaling the amount of egocentric human pretraining data and the capacity of the flow-based Vision–Language–Action policy introduced in EgoScale yields further gains in dexterous robot manipulation, specifically improved long-horizon planning and compositional generalization beyond the gains observed when scaling data alone.

Background

EgoScale pretrains a flow-based Vision–Language–Action (VLA) policy on 20,854 hours of egocentric human video and demonstrates a log-linear scaling law between human action prediction loss and data scale, with strong correlation to downstream real-robot performance. Within the explored regime, increasing human data yields predictable improvements without saturation.

In the conclusion, the authors note that future progress may come from not only scaling data but also increasing model capacity. They suggest that jointly scaling human data and model capacity could unlock further gains, specifically in long-horizon planning and compositional generalization, indicating a concrete open direction to empirically verify and characterize.

References

Looking forward, several directions remain open. While we observe no saturation within the explored regime, jointly scaling human data and model capacity may unlock further gains, including improved long-horizon planning and compositional generalization.

EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data  (2602.16710 - Zheng et al., 18 Feb 2026) in Conclusion