FlexTok: Resampling Images into 1D Token Sequences of Flexible Length (2502.13967v2)

Published 19 Feb 2025 in cs.CV and cs.LG

Abstract: Image tokenization has enabled major advances in autoregressive image generation by providing compressed, discrete representations that are more efficient to process than raw pixels. While traditional approaches use 2D grid tokenization, recent methods like TiTok have shown that 1D tokenization can achieve high generation quality by eliminating grid redundancies. However, these methods typically use a fixed number of tokens and thus cannot adapt to an image's inherent complexity. We introduce FlexTok, a tokenizer that projects 2D images into variable-length, ordered 1D token sequences. For example, a 256x256 image can be resampled into anywhere from 1 to 256 discrete tokens, hierarchically and semantically compressing its information. By training a rectified flow model as the decoder and using nested dropout, FlexTok produces plausible reconstructions regardless of the chosen token sequence length. We evaluate our approach in an autoregressive generation setting using a simple GPT-style Transformer. On ImageNet, this approach achieves an FID<2 across 8 to 128 tokens, outperforming TiTok and matching state-of-the-art methods with far fewer tokens. We further extend the model to support to text-conditioned image generation and examine how FlexTok relates to traditional 2D tokenization. A key finding is that FlexTok enables next-token prediction to describe images in a coarse-to-fine "visual vocabulary", and that the number of tokens to generate depends on the complexity of the generation task.

PDF Abstract

Overview of FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

The paper "FlexTok: Resampling Images into 1D Token Sequences of Flexible Length" introduces a novel approach to image tokenization aimed at advancing autoregressive (AR) image generation efficiency and quality. Traditional image generation often employs fixed-size 2D grid tokenization, leading to inefficiencies when the complexity of image content varies. FlexTok overcomes this limitation by allowing images to be represented as variable-length 1D token sequences, adjusting to the actual complexity of the images being processed.

Key Contributions and Findings

The primary contributions of this research include:

1D Tokenization with Flexible Length: FlexTok innovatively transforms 2D images into sequences of 1D tokens, which can be of variable lengths, ranging from 1 to 256 tokens for images sized 256x256. This flexible length tokenization aids in maintaining image comprehension and quality across various levels of complexity, differing from rigid 2D grid approaches where every image is forced into the same token count.
Autoregressive Training and Generation: In an AR setting, FlexTok demonstrates strong generation capabilities on datasets like ImageNet by utilizing a simple GPT-style Transformer. The implementation showcases specific strength in class-conditioning scenarios where coherent output, such as determining a "golden retriever" condition with minimal tokens, can be achieved. This is contrasted against traditional methods like LlamaGen, which require full token generation regardless of image complexity.
Decoding with Rectified Flow: Through the use of a rectified flow model as the decoder, FlexTok can produce high-quality image reconstructions from token sequences. This model supports varying sequence lengths, crucially maintaining the fidelity of the recreated images regardless of the compression level imposed by the token sequence length. The addition of nested dropout further facilitates robust handling of tokens.
Visual Vocabulary Emergence: The report identifies an emergent visual vocabulary within the 1D token sequences. This vocabulary allows for a coarse-to-fine image description starting with high-level semantic concepts and concluding with finer detail definition as more tokens are processed. This structured and ordered progression contrasts with traditional non-hierarchical token sequences, offering enriched semantic compression.

Implications and Future Perspectives

From a practical standpoint, FlexTok's approach could reduce computational resources needed for AR modeling by minimizing token count per image based on content complexity, thus optimizing both storage and processing efficiency in large-scale implementations. The theoretical significance lies in how FlexTok extends the conceptual understanding of tokenization utility beyond fixed-grid paradigms.

Looking ahead, the implications of FlexTok could extend into other domains such as video and audio signal processing, where redundancy varies greatly and flexible tokenization could yield similar efficiency and performance benefits. Furthermore, future enhancements could explore integrating FlexTok with lighter-weight decoders or employing scalable models trained in zero-shot learning settings, which might generalize across diverse tasks more fluently.

Overall, while the improvement may not be deemed radical, the FlexTok approach initiates meaningful progress in the adaptation and efficiency of generative image modeling. It encourages ongoing efforts toward computational thrift and extends applicability across broad generative modeling contexts. The continued exploration of these principles could notably augment practical machine learning deployments in resource-constrained environments.