Papers
Topics
Authors
Recent
2000 character limit reached

RecTok: Reconstruction Distillation along Rectified Flow (2512.13421v2)

Published 15 Dec 2025 in cs.CV

Abstract: Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts. In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction--alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distills the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss. Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance. It achieves state-of-the-art results on the gFID-50K under both with and without classifier-free guidance settings, while maintaining a semantically rich latent space structure. Furthermore, as the latent dimensionality increases, we observe consistent improvements. Code and model are available at https://shi-qingyu.github.io/rectok.github.io.

Summary

  • The paper introduces a dual-distillation framework leveraging flow semantic distillation and reconstruction‐alignment to enable high-dimensional, semantically robust visual tokenization.
  • It demonstrates state-of-the-art generation quality with improved gFID and inception scores while accelerating convergence compared to prior VFM-aligned methods.
  • Empirical evaluations on ImageNet-1K validate that increasing latent dimensionality consistently enhances reconstruction, generation, and representation performance.

RecTok: Reconstruction Distillation along Rectified Flow

Motivation and Problem Statement

Continuous tokenization via autoencoders (VAE, VQ-VAE) is the workhorse for latent generative models, notably diffusion models. State-of-the-art image generation and editing models are increasingly bottlenecked by the latent space properties induced by their visual tokenizers. The empirical trade-off between latent dimensionality and generative performance remains unresolved: while high-dimensional latents preserve more content and permit discriminative representations, they have empirically led to optimization and generalization issues, causing the community to restrict production models to low-dimensional spaces. Prior attempts to distill semantics from vision foundation models (VFMs) into the tokenizers have not resolved this bottleneck; generation quality and convergence degrade as dimensionality increases beyond ∟32. The absence of a tokenizer that admits high-dimensional, semantically-rich, and generatively stable representations has hampered unified modeling of editing, generation, and understanding tasks.

Technical Contributions: Flow Semantic Distillation and Reconstruction-Alignment Distillation

RecTok introduces two novel mechanisms in visual tokenizer design, formalized within the rectified flow framework for image generation.

  1. Flow Semantic Distillation (FSD): Unlike prior approaches that align the VFM semantics only on the noiseless code x0x_0, RecTok systematically supervises the forward flow {xt∣t∈[0,1]}\{x_t|t\in[0,1]\} between data and noise. This trajectory, governed by the rectified flow, is directly targeted by the diffusion models in downstream training. By distilling VFM features along this entire path (rather than solely x0x_0), semantic structure is enforced on all intermediate latents, resulting in substantially more discriminative and robust representations for the diffusion transformer. This design choice is empirically validated via linear probing accuracy and t-SNE visualizations. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: Linear probing accuracy and t-SNE organization show RecTok's superiority across the forward flow.

  1. Reconstruction-Alignment Distillation (RAD): Drawing inspiration from masked image modeling, random spatial masking is applied to the input. The goal is to reconstruct VFM features for both masked and unmasked regions by decoding noisy latent features from the forward flow. This dual-objective—pixel-space reconstruction and VFM feature alignment—further enforces semantic content and robustness, especially as latent dimensionality increases. Empirical ablations confirm that the joint optimization outperforms either objective in isolation. Figure 2

    Figure 2: Overview of RecTok pipeline, illustrating random masking, forward flow, and dual decoders for training.

Empirical Validation and Ablations

RecTok is instantiated as a ViT-based autoencoder (ViT-B backbone with additional normalization and activation modules), evaluated primarily on ImageNet-1K at 256x256 resolution. The empirical results demonstrate:

  • State-of-the-art gFID (1.34 without guidance, 1.13 with classifier-free guidance), outperforming both pixel-space and prior latent-space methods. Notably, convergence is expedited by 7.75× compared to the strongest VFM-aligned baselines.
  • Inception Score (IS) scaling: Unlike previous approaches, IS continues to improve with epoch and model size, confirming improved diversity and realism.
  • Dimensionality scalability: Unlike prior work (VA-VAE, RAE, MaskDiT, GigaTok), RecTok enjoys monotonic improvements in all relevant metrics—generation (gFID, IS), reconstruction (rFID, PSNR), and linear probing accuracy—as latent dimensionality is increased to 128 and beyond.
  • Semantic richness: RecTok's PCA and cosine similarity heatmaps reveal structure with clear spatial correspondence to object-level semantics present in VFM features, in contrast to prior methods (Figure 3 and 7). Figure 3

    Figure 3: PCA and similarity heatmap overlays indicate RecTok's latent structure superiority.

    Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: Foreground object structure is highly localized in RecTok features; VA-VAE lacks such explicit semantic grouping.

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5: FID rapidly converges (left), scaling with parameter count and consistent improvements over epochs.

Qualitative Analysis

Class-conditional samples generated from RecTok-trained DiT models exhibit high-fidelity texture, color, and structure preservation, with the absence of typical artifacts prominent in high-compression or low-dimension regimes. Reconstructions faithfully capture fine object details, as illustrated by the inpainting and supplementary figures. RecTok’s outputs generalize across classes, indicating robust learned priors rather than mode collapse or overfitting. Figure 6

Figure 6: High-quality ImageNet-1K generations, faithfully capturing class diversity and fine structure.

Figure 7

Figure 7

Figure 7: Input (left) and reconstructed (right) images demonstrate content fidelity and low distortion.

Theoretical and Practical Implications

The core insight is that semantic consistency across the full forward flow trajectory, rather than limited supervision at the origin of the flow, is essential for scalable tokenizers. This insight is formalized and validated by RecTok. The empirical claims—that increasing latent dimensionality no longer degrades generation or semantics—contradict the prevailing consensus in both the pixel-diffusion and latent-diffusion literature.

Practically, RecTok enables unified visual backbones for generation, editing, and understanding. The semantically consistent high-dimensional representation aligns the internal structure required for editing (where reconstruction fidelity is paramount) and class-conditional or open-ended generation (where expressiveness and diversity are necessary). This bridges the prior gap where low-dimensional latents were required for generation, but these same spaces were inadequate for representation-intensive tasks.

Theoretically, the study supports a deeper investigation into optimizing flow trajectories in generative models, particularly the induction of VFM-style semantics throughout the generative process. It also motivates further analysis of the interplay between KL regularization, reconstruction loss, and feature distillation at scale.

Future Directions

  • Scaling latent dimensionality: With monotonic improvement observed up to 128 dimensions, it is plausible that even higher latent dimensionalities (subject to architecture and compute constraints) could enable direct application to high-resolution or video domains.
  • Adaptive regularization: The results indicate that the balance between KL penalty and feature alignment is critical. Future work could explore adaptive or learned weighting for these signals, possibly leveraging task-driven or downstream feedback.
  • Applications to unified multi-modal models: RecTok’s shared representation space is compatible with recent unified architectures (multimodal LLMs), enabling tokenization that equally supports synthesis and understanding.
  • Further bridging with VFMs: Although RecTok narrows the gap, discriminative performance still lags behind frozen VFM backbones. Joint optimization, ensembling, or more sophisticated flow designs may close this gap.

Conclusion

RecTok establishes that high-dimensional, semantically-rich, and generatively robust visual tokenization is feasible and highly effective when semantic consistency is enforced along the rectified flow trajectory. This paradigm breaks the entrenched reconstruction-generation tradeoff, yielding state-of-the-art performance in both domains without sacrificing discriminative power. The methodology enables the practical unification of visual representation for editing, generation, and understanding in a single backbone and points toward further scaling and integration in future AI systems (2512.13421).

Whiteboard

Collections

Sign up for free to add this paper to one or more collections.