Unified VFM feature space across perception, reconstruction, generation, and understanding
Determine whether a unified Visual Foundation Model (VFM) representation space can simultaneously support visual reconstruction, perception, high-fidelity image generation, and semantic understanding without compromising performance on any of these tasks.
Sponsor
References
However, two key challenges remain unresolved: Can a unified feature space support visual reconstruction, perception, high-fidelity generation, and semantic understanding without compromising performance on any of these tasks?
— SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder
(2512.11749 - Shi et al., 12 Dec 2025) in Section 1 (Introduction)