Unified VFM feature space across perception, reconstruction, generation, and understanding

Determine whether a unified Visual Foundation Model (VFM) representation space can simultaneously support visual reconstruction, perception, high-fidelity image generation, and semantic understanding without compromising performance on any of these tasks.

Background

The paper advocates operating text-to-image diffusion directly in high-dimensional Visual Foundation Model (VFM) features to unify visual understanding, perception, and generation. Traditional VAE latents lack coherent semantic structure, motivating exploration of VFM-based latents as a shared substrate for multiple tasks.

The authors highlight a core unresolved question about whether a single, principled representation space can jointly maintain low-level perceptual fidelity, accurate reconstruction, and high-level semantic understanding and generation—eliminating the need for fragmented, task-specific encoders.

References

However, two key challenges remain unresolved: Can a unified feature space support visual reconstruction, perception, high-fidelity generation, and semantic understanding without compromising performance on any of these tasks?

SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder (2512.11749 - Shi et al., 12 Dec 2025) in Section 1 (Introduction)