Effect of Scaling Video Foundation Model Training Data on MV-VDP Generalization

Determine whether increasing the amount of training data used to pretrain the Wan2.2 video foundation model that serves as the backbone of MV-VDP improves the generalization performance of MV-VDP on unseen tasks and visual variations.

Background

In real-world evaluations, MV-VDP underperforms BridgeVLA on certain unseen settings (Put-B and Put-H), which the authors attribute to BridgeVLA’s vision–language backbone being pretrained on larger-scale image–text data and thus capturing more visual variation.

Motivated by this observation, the authors explicitly conjecture that scaling up the training data for the video foundation model used by MV-VDP could similarly enhance generalization, leaving open the question of whether such scaling would close the gap on unseen scenarios.

References

We further conjecture that scaling up the training data for the video foundation model could similarly improve the generalization ability of our approach.

Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model  (2604.03181 - Li et al., 3 Apr 2026) in Experiments — Real-World Experiments, Results (Generalization to unseen tasks)