Effectiveness of a Fully Vision-Centric Paradigm on Complex Tasks

Develop training strategies and architectural designs that make a fully vision-centric paradigm—where a single visual model processes images, visualized text, and audio—effective on complex downstream tasks.

Background

The authors advocate moving beyond text-tokenizer-based LLMs toward a vision-centric approach that unifies multiple modalities within a single visual model, citing CLIPPO as an example of processing images and text jointly.

While this paradigm shows promise, the authors explicitly state that making it effective on more complex tasks is an unresolved challenge, and they highlight the need for advances in training strategies and architectural design to realize its potential.

References

Nonetheless, making a fully vision-centric paradigm effective on more complex tasks remains an open challenge, requiring further exploration in training strategies and architectural design to unlock its full potential.

— See the Text: From Tokenization to Visual Reading (2510.18840 - Xing et al., 21 Oct 2025) in Appendix, Section “Promising Direction: The Vision-Centric Paradigm”

Effectiveness of a Fully Vision-Centric Paradigm on Complex Tasks

Background

References

Related Problems