Training visual tokenizers at higher resolution and variable aspect ratios

Develop training procedures for the Transformer-based visual tokenizer with Binary Spherical Quantization (BSQ-ViT) to handle higher-resolution inputs and variable aspect ratios, extending beyond the current experiments on 128×128 and 256×256 resolutions for images and 128×128 for videos.

Background

The paper evaluates the proposed BSQ-ViT tokenizer on relatively low-resolution images (128×128 and 256×256) and videos (128×128). While BSQ-ViT shows strong performance within these settings, the authors explicitly note that adapting training to higher-resolution inputs and variable aspect ratios has not been explored.

This open problem is important for practical deployment and broader applicability, as many real-world image and video datasets contain diverse resolutions and aspect ratios. Addressing it may require modifications to model architecture (e.g., positional embeddings or patching strategy), training objectives, data preprocessing, and optimization schedules to maintain stability and code usage at scale.

References

Training a visual tokenizer on higher-resolution inputs and variable aspect ratio remains unexplored.

Image and Video Tokenization with Binary Spherical Quantization  (2406.07548 - Zhao et al., 2024) in Appendix, Section "Limitations"