Effectiveness of a Fully Vision-Centric Paradigm on Complex Tasks
Develop training strategies and architectural designs that make a fully vision-centric paradigm—where a single visual model processes images, visualized text, and audio—effective on complex downstream tasks.
References
Nonetheless, making a fully vision-centric paradigm effective on more complex tasks remains an open challenge, requiring further exploration in training strategies and architectural design to unlock its full potential.
— See the Text: From Tokenization to Visual Reading
(2510.18840 - Xing et al., 21 Oct 2025) in Appendix, Section “Promising Direction: The Vision-Centric Paradigm”