Discriminative capability gap between RecTok and Vision Foundation Models
Determine how to eliminate the gap in discriminative capability between latent representations produced by RecTok, a high-dimensional visual tokenizer trained with Flow Semantic Distillation and Reconstruction–Alignment Distillation, and those of Vision Foundation Models such as DINOv3, SigLIP 2, and SAM.
Sponsor
References
In terms of semantics, although RecTok enhances semantic structure by increasing the latent dimensionality, its discriminative capability still lags behind that of VFMs. We leave these challenges as open questions for future work.
— RecTok: Reconstruction Distillation along Rectified Flow
(2512.13421 - Shi et al., 15 Dec 2025) in Supplementary, Section: Limitations