Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 59 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 33 tok/s Pro

GPT-4o 127 tok/s Pro

Kimi K2 189 tok/s Pro

GPT OSS 120B 421 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Diffusion Autoencoders are Scalable Image Tokenizers (2501.18593v1)

Published 30 Jan 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Tokenizing images into compact visual representations is a key step in learning efficient and high-quality image generative models. We present a simple diffusion tokenizer (DiTo) that learns compact visual representations for image generation models. Our key insight is that a single learning objective, diffusion L2 loss, can be used for training scalable image tokenizers. Since diffusion is already widely used for image generation, our insight greatly simplifies training such tokenizers. In contrast, current state-of-the-art tokenizers rely on an empirically found combination of heuristics and losses, thus requiring a complex training recipe that relies on non-trivially balancing different losses and pretrained supervised models. We show design decisions, along with theoretical grounding, that enable us to scale DiTo for learning competitive image representations. Our results show that DiTo is a simpler, scalable, and self-supervised alternative to the current state-of-the-art image tokenizer which is supervised. DiTo achieves competitive or better quality than state-of-the-art in image reconstruction and downstream image generation tasks.

Summary

The paper introduces DiTo, a self-supervised framework that uses a single diffusion L2 loss to train scalable image tokenizers.
Experimental results on ImageNet show that DiTo outperforms GAN-LPIPS tokenizers in reconstructing complex visual details.
The method’s scalability and simplified training pipeline enable efficient joint learning of latent representations for generative models.

An Expert Review of "Diffusion Autoencoders are Scalable Image Tokenizers"

The paper "Diffusion Autoencoders are Scalable Image Tokenizers" introduces a self-supervised framework called the Diffusion Tokenizer (DiTo). This novel method aims to simplify and improve the process of learning compact visual representations for image generation tasks. At its core, DiTo utilizes a diffusion loss-based objective to train scalable image tokenizers, highlighting its potential as a simpler alternative to the currently dominant GAN-LPIPS tokenizers.

Theoretical Framework and Motivation

The foundational motivation behind the paper is to address the complexity inherent in state-of-the-art image tokenizers. Existing methods rely heavily on a combination of losses and heuristics, often involving pretrained supervised models, which complicates the training pipeline. By contrast, DiTo leverages a single diffusion L2 loss for training, rooted in the diffusion model framework that has gained prominence for image generation.

Diffusion models, known for their robust probabilistic modeling capabilities, form the backbone of DiTo. These models estimate the data distribution by learning score functions via a stochastic process, which the paper reframes within the context of autoencoder training. DiTo employs Flow Matching, a diffusion-based objective that aligns closely with the Evidence Lower Bound (ELBO) theory. This choice is pivotal, as it grounds the autoencoder's training in a theoretically sound probabilistic framework.

Experimental Validation

Experiments conducted on large-scale datasets such as ImageNet demonstrate that DiTo achieves superior image reconstruction quality compared to GAN-LPIPS tokenizers, particularly in handling complex visual structures like text and symbols. Moreover, image generation models trained on DiTo's latent representations exhibit competitive or improved performance over those using traditional methods.

The scalability of DiTo is tested across various model sizes, consistently showing improved reconstruction fidelity and visual quality as the model is scaled up. This attribute is less pronounced in GAN-LPIPS where performance gains plateau with increased complexity. The paper further explores the advantages of DiTo through ablations, highlighting that its design is well-suited for joint learning of latent representations alongside powerful probabilistic decoders.

Practical and Theoretical Implications

Practically, DiTo simplifies the training process by removing the need for balancing multiple losses and pretrained discriminative models, leading to efficient and scalable image tokenizer training. Theoretically, employing an ELBO-maximizing objective ensures that the latent representations capture relevant information comprehensively, enhancing both the quantitative and qualitative performance of downstream generative models.

Speculation on Future Directions

The paper hints at future research directions, including expanding DiTo's applicability beyond images to video, audio, and other modalities. Such extensions could offer a unified approach for scalable, self-supervised representation learning in multimedia contexts. Additionally, exploring content-aware tokenizers could lead to optimizing information density encoding, further enhancing model efficiency and performance.

Conclusion

In conclusion, "Diffusion Autoencoders are Scalable Image Tokenizers" contributes significantly to the field of machine learning by presenting a streamlined, theoretically robust method for image tokenization and generation. Through the lens of diffusion models and self-supervision, it offers a promising alternative to the intricacies of current state-of-the-art tokenizers, paving the way for more elegant and effective models in AI-driven image processing tasks.