Impact of visual encoder initialization on GEA performance
Determine whether the stronger impact of initializing the LLaVA‑OneVision SigLIP visual encoder, relative to initializing only the large language model, on Generalist Embodied Agent (GEA) performance arises because the evaluated benchmarks require visual generalization and because training the SigLIP visual encoder from scratch using only embodied data is challenging.
Sponsor
References
We conjecture that this is because the benchmarks require visual generalization, and training the LLaVA-OneVision SigLIP visual encoder from scratch with only the embodied data is challenging.
— From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons
(2412.08442 - Szot et al., 11 Dec 2024) in Section 5.2 (GEA Training and Model Analysis: Impact of pretrained MLLM)