ImageFolder: Autoregressive Image Generation with Folded Tokens (2410.01756v3)

Published 2 Oct 2024 in cs.CV

Abstract: Image tokenizers are crucial for visual generative models, e.g., diffusion models (DMs) and autoregressive (AR) models, as they construct the latent representation for modeling. Increasing token length is a common approach to improve the image reconstruction quality. However, tokenizers with longer token lengths are not guaranteed to achieve better generation quality. There exists a trade-off between reconstruction and generation quality regarding token length. In this paper, we investigate the impact of token length on both image reconstruction and generation and provide a flexible solution to the tradeoff. We propose ImageFolder, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling to improve both generation efficiency and quality. To enhance the representative capability without increasing token length, we leverage dual-branch product quantization to capture different contexts of images. Specifically, semantic regularization is introduced in one branch to encourage compacted semantic information while another branch is designed to capture the remaining pixel-level details. Extensive experiments demonstrate the superior quality of image generation and shorter token length with ImageFolder tokenizer.

Citations (6)

View on Semantic Scholar

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper ImageFolder introduces an innovative semantic image tokenizer that creates spatially aligned, foldable tokens to balance reconstruction quality and token length for efficient autoregressive image generation.
ImageFolder employs a dual-branch product quantization strategy to capture both semantic regularization and pixel-level detail, combined with parallel prediction and token folding to halve the sequence length during AR modeling.
Evaluations on ImageNet demonstrate ImageFolder's superior performance in reconstruction and generation FID metrics compared to benchmarks, along with high efficiency due to reduced token sequence length.

The paper "ImageFolder: Autoregressive Image Generation with Folded Tokens" introduces ImageFolder, an innovative semantic image tokenizer designed to enhance autoregressive image generation. The central challenge addressed is the inherent trade-off between reconstruction quality and token length in image tokenizers, particularly within autoregressive (AR) models.

Key Contributions and Methodology:

Image Tokenization Strategy:
- ImageFolder introduces a novel approach by creating spatially aligned image tokens which can be folded to reduce token length, thereby optimizing both generation efficiency and quality. This is achieved without sacrificing the reconstruction quality typically associated with increased token lengths.
Use of Product Quantization (PQ):
- The authors employ dual-branch product quantization to capture distinct aspects of images. One branch focuses on semantic regularization to encapsulate semantic information, while the other branch is dedicated to retaining pixel-level detail.
- The quantized tokens from each branch are combined, ensuring that both semantic and detail features are preserved.
Parallel Prediction and Token Folding:
- A notable feature of ImageFolder is the ability to predict two tokens from a single logit. This ability significantly shortens sequence length during AR modeling, improving computational efficiency without degrading the generation quality.
- Token folding during AR modeling effectively halves the sequence length, facilitating faster processing.
Representation and Semantic Regularization:
- Semantic regularization is applied within the quantization process to maintain compact semantic information. By segregating semantics from pixel details, the approach minimizes dependency and potential redundancy between tokens.
Performance and Evaluation:
- Extensive experiments on the ImageNet dataset demonstrate the superiority of ImageFolder over traditional methods. Key metrics for evaluation include Fréchet Inception Distance (FID) for both reconstruction (rFID) and generation (gFID), where ImageFolder consistently outperformed benchmarks.
- The ImageFolder also demonstrated high linear probing accuracy, indicating robust semantic representation capabilities.
Efficiency and Scalability:
- ImageFolder's AR model benefits from significantly reduced computational load due to shorter token sequences, which remain computationally efficient for LLMs.
- The proposed method shows potential for scalability in multimodal tasks, highlighting its versatility beyond image generation.
Qualitative and Quantitative Evaluations:
- ImageFolder's generated images have been characterized by high fidelity and semantic accuracy. The combination of semantic-rich tokens and detailed tokens allows for effective generation even in zero-shot conditional scenarios.

The paper concludes by recognizing the potential for further optimization of ImageFolder with more advanced training regimes. Overall, ImageFolder presents a systematic and innovative approach to overcoming the limitations of traditional image tokenizers in AR models, balancing the length-quality trade-off and facilitating efficient high-quality image generation.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

ImageFolder: Autoregressive Image Generation with Folded Tokens (2410.01756v3)

Collections

Summary

Follow-up Questions

Related Papers

Authors (7)