Training visual tokenizers at higher resolution and variable aspect ratios
Develop training procedures for the Transformer-based visual tokenizer with Binary Spherical Quantization (BSQ-ViT) to handle higher-resolution inputs and variable aspect ratios, extending beyond the current experiments on 128×128 and 256×256 resolutions for images and 128×128 for videos.
References
Training a visual tokenizer on higher-resolution inputs and variable aspect ratio remains unexplored.
— Image and Video Tokenization with Binary Spherical Quantization
(2406.07548 - Zhao et al., 2024) in Appendix, Section "Limitations"