Training a fully native VLM entirely from scratch
Develop a training procedure to train a fully native vision-language model such as NEO entirely from scratch without initialization from any pretrained large language model.
References
Constrained by current text corpus and computational resources, we are unable to train a fully native model entirely from scratch without initialization from an existing LLM. This limitation also hinders our ability to mitigate potential biases arising from the dominance of the language modality.
— From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
(2510.14979 - Diao et al., 16 Oct 2025) in Appendix, Subsection "Limitation and Discussion"