Discriminative capability gap between RecTok and Vision Foundation Models

Determine how to eliminate the gap in discriminative capability between latent representations produced by RecTok, a high-dimensional visual tokenizer trained with Flow Semantic Distillation and Reconstruction–Alignment Distillation, and those of Vision Foundation Models such as DINOv3, SigLIP 2, and SAM.

Background

RecTok is proposed as a high-dimensional visual tokenizer that distills semantics along the forward rectified flow and incorporates masked feature reconstruction to improve both generative performance and semantic consistency. Despite these advances, the authors report a key limitation in semantic discrimination.

Specifically, even with increased latent dimensionality, RecTok’s latent features remain less discriminative than those from strong Vision Foundation Models (VFMs). The authors explicitly flag this limitation as an open question for future work, indicating that closing this gap is not yet resolved.

References

In terms of semantics, although RecTok enhances semantic structure by increasing the latent dimensionality, its discriminative capability still lags behind that of VFMs. We leave these challenges as open questions for future work.

RecTok: Reconstruction Distillation along Rectified Flow (2512.13421 - Shi et al., 15 Dec 2025) in Supplementary, Section: Limitations