Mechanism underlying task-dependent differences between sequence and feature action conditioning
Ascertain the underlying mechanism that causes sequence conditioning (actions encoded as tokens concatenated along the sequence dimension with Rotary Position Embeddings) to outperform feature conditioning (actions concatenated along the embedding dimension) on DROID, Robocasa, and Metaworld manipulation tasks, while feature conditioning outperforms sequence conditioning on the Wall 2D navigation task, even when the action-to-visual dimensional ratios are matched in JEPA world model predictors.
Sponsor
References
We cannot provide a precise explanation of the underlying mechanism explaining why we observe such differences.
— What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?
(2512.24497 - Terver et al., 30 Dec 2025) in Appendix, Section 6.1 (Additional experiments), paragraph “Equalized action ratio experiments”