Training a fully native VLM entirely from scratch

Develop a training procedure to train a fully native vision-language model such as NEO entirely from scratch without initialization from any pretrained large language model.

Background

NEO is introduced as a native, unified vision-LLM that integrates image and text processing within a single architecture, leveraging a pre-Buffer and post-LLM design during training to align pixels and words efficiently.

In the limitations, the authors state that due to constraints in available text corpora and computational resources, they could not train a fully native model purely from scratch without starting from an existing LLM. Overcoming this limitation would validate truly de novo multimodal training in their framework and further reduce reliance on language-initialized models.

References

Constrained by current text corpus and computational resources, we are unable to train a fully native model entirely from scratch without initialization from an existing LLM. This limitation also hinders our ability to mitigate potential biases arising from the dominance of the language modality.

— From Pixels to Words -- Towards Native Vision-Language Primitives at Scale (2510.14979 - Diao et al., 16 Oct 2025) in Appendix, Subsection "Limitation and Discussion"

Training a fully native VLM entirely from scratch

Sponsor

Background

References

Related Problems