TokBench: Evaluating Your Visual Tokenizer before Visual Generation (2505.18142v2)

Published 23 May 2025 in cs.CV and cs.DB

Abstract: In this work, we reveal the limitations of visual tokenizers and VAEs in preserving fine-grained features, and propose a benchmark to evaluate reconstruction performance for two challenging visual contents: text and face. Visual tokenizers and VAEs have significantly advanced visual generation and multimodal modeling by providing more efficient compressed or quantized image representations. However, while helping production models reduce computational burdens, the information loss from image compression fundamentally limits the upper bound of visual generation quality. To evaluate this upper bound, we focus on assessing reconstructed text and facial features since they typically: 1) exist at smaller scales, 2) contain dense and rich textures, 3) are prone to collapse, and 4) are highly sensitive to human vision. We first collect and curate a diverse set of clear text and face images from existing datasets. Unlike approaches using VLM models, we employ established OCR and face recognition models for evaluation, ensuring accuracy while maintaining an exceptionally lightweight assessment process <span style="font-weight: bold; color: rgb(214, 21, 21);">requiring just 2GB memory and 4 minutes</span> to complete. Using our benchmark, we analyze text and face reconstruction quality across various scales for different image tokenizers and VAEs. Our results show modern visual tokenizers still struggle to preserve fine-grained features, especially at smaller scales. We further extend this evaluation framework to video, conducting comprehensive analysis of video tokenizers. Additionally, we demonstrate that traditional metrics fail to accurately reflect reconstruction performance for faces and text, while our proposed metrics serve as an effective complement.

Summary

The paper presents TokBench, a benchmark specifically evaluating visual tokenizers on preserving text and facial details using metrics like T-ACC, T-NED, and F-Sim.
The methodology compares discrete and continuous models under various compression strategies, revealing key performance differences in reconstruction fidelity.
The findings indicate that continuous VAEs outperform discrete tokenizers, highlighting the need for improved architectures to enhance detailed visual reconstruction.

TokBench: Evaluating Your Visual Tokenizer Before Visual Generation

The paper introduces TokBench, a benchmark designed to evaluate the reconstruction performance of visual tokenizers, with a particular focus on fine-grained features preservation in two challenging visual elements: text and human faces.

Introduction to Visual Tokenization

The advancement of visual generation models, led by autoregressive and diffusion models, has significantly benefited from the use of visual tokenizers. These tokenizers compress images into discrete tokens, enabling sophisticated sequential prediction capabilities. However, current models often encounter significant challenges in preserving fine-grained visual details, especially for human-sensitive content such as text and facial features. TokBench aims to address and evaluate these limitations effectively.

TokBench Dataset and Metrics

TokBench comprises a curated collection of images and video clips enriched with text and facial content. The dataset features a balanced instance-scale distribution, providing a robust ground for evaluating the reconstruction capabilities of visual tokenizers (Figure 1).

Figure 1: Statistics and Sample Diversity of TokBench-Image. TokBench features a balanced instance-scale distribution with particular emphasis on small-scale face and text instances, presenting significant challenges for existing visual reconstruction approaches.

The benchmark employs specialized metrics to assess reconstruction performance: T-ACC (Text Recognition Accuracy), T-NED (Text Normalized Edit Distance), and F-Sim (Facial Similarity). These metrics focus on readability and identity preservation, diverging from traditional pixel-level and feature-based metrics, which often fail to align with human perceptual assessments (Figure 2).

Figure 2: Comparison of Different Metrics with Human Judgments. In each case, previous metrics (PSNR, SSIM, LPIPS) demonstrate discrepancies with human assessments, whereas our proposed face similarity and text accuracy effectively reflect the reconstruction quality.

Evaluation Framework

TokBench is lightweight, requiring minimal computational resources for evaluation. The benchmark can complete assessments within 4 minutes using just 2GB of memory, offering an efficient and accessible platform for testing visual tokenizers.

Image and Video Analysis

Various approaches to visual tokenization, including discrete VQVAEs and continuous-space VAEs, were evaluated on TokBench. It was demonstrated that continuous VAEs like SD3.5 and FLUX significantly outperform discrete models, particularly in text and face reconstruction tasks (Figure 3).

Figure 3: T-ACC and F-Sim metrics across reconstruction resolutions versus target scales. Smaller scales present greater challenges, and even the best-performing VAE show gap for improvement when compared to the ``resize'' upper bound.

Current discrete tokenizers exhibit insufficient performance due to information loss during vector quantization, especially at smaller scales. In contrast, continuous models handle reconstruction fidelity more effectively but still show room for improvement compared to theoretical upper bounds.

Ablation Studies and Additional Insights

Additional experiments revealed that adding text-rich data to training improved performance, although the enhancements were moderate, indicating that architectural design holds more significance than the training dataset composition.

A detailed examination of video tokenization further confirmed that excessive compression in both spatial and temporal dimensions compromises reconstruction quality. Nonetheless, models like Hunyuan-Video and CogVideoX exhibit commendable performance even under competitive compression settings.

Conclusion

TokBench provides an innovative benchmark explicitly tailored to evaluate visual tokenizers' reconstruction quality with a focus on text and facial features. It complements traditional evaluation metrics by offering a more pertinent assessment aligned with human perceptual standards. The results emphasize the need for improved architectures capable of preserving critical hierarchies of visual details, pointing towards potential research directions in model architecture and training methodologies. The paper serves as a crucial resource for advancing developments in visual tokenization and reconstruction fidelity.