Scaling laws for 3D VLM pretraining and downstream robot control

Investigate and characterize the scaling laws that relate SPEAR-1’s downstream robot control performance to the quantity and quality of the 3D visual question answering pretraining data used to train SPEAR-VLM, in order to understand how data scale and data quality affect task performance.

Background

SPEAR-1 is built on SPEAR-VLM, a 3D-aware vision-LLM trained on non-robotic 2D images enriched with 3D annotations. The authors show that this pretraining improves downstream control but note that their experiments were constrained in data scale and diversity.

They explicitly state that the relationship between downstream performance and the quantity and quality of 3D pretraining data is not well understood, leaving open how increasing or curating such data would impact generalization and control accuracy.

References

While we have showed the benefits of 3D VLM pre-training on downstream robot control tasks, the scaling laws relating the latter to the quantity and quality of 3D pre-training data are still not well understood.

SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding (2511.17411 - Nikolov et al., 21 Nov 2025) in Section: Discussion and Limitations