Impact of visual encoder initialization on GEA performance

Determine whether the stronger impact of initializing the LLaVA‑OneVision SigLIP visual encoder, relative to initializing only the large language model, on Generalist Embodied Agent (GEA) performance arises because the evaluated benchmarks require visual generalization and because training the SigLIP visual encoder from scratch using only embodied data is challenging.

Background

In the analysis of GEA’s training and model components, the authors compared different initialization strategies: initializing only the LLM, only the visual encoder, or the full multimodal model. They observed that visual encoder initialization had a stronger impact on performance than LLM initialization across several benchmarks.

To explain this observation, they offer a conjecture that the benchmarks demand strong visual generalization and that training the SigLIP visual encoder from scratch using only embodied data is difficult, suggesting the need to verify this causal explanation.

References

We conjecture that this is because the benchmarks require visual generalization, and training the LLaVA-OneVision SigLIP visual encoder from scratch with only the embodied data is challenging.

From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons (2412.08442 - Szot et al., 11 Dec 2024) in Section 5.2 (GEA Training and Model Analysis: Impact of pretrained MLLM)