Highly Compressed Tokenizer Can Generate Without Training

Published 9 Jun 2025 in cs.CV and cs.AI | (2506.08257v1)

Abstract: Commonly used image tokenizers produce a 2D grid of spatially arranged tokens. In contrast, so-called 1D image tokenizers represent images as highly compressed one-dimensional sequences of as few as 32 discrete tokens. We find that the high degree of compression achieved by a 1D tokenizer with vector quantization enables image editing and generative capabilities through heuristic manipulation of tokens, demonstrating that even very crude manipulations -- such as copying and replacing tokens between latent representations of images -- enable fine-grained image editing by transferring appearance and semantic attributes. Motivated by the expressivity of the 1D tokenizer's latent space, we construct an image generation pipeline leveraging gradient-based test-time optimization of tokens with plug-and-play loss functions such as reconstruction or CLIP similarity. Our approach is demonstrated for inpainting and text-guided image editing use cases, and can generate diverse and realistic samples without requiring training of any generative model.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper demonstrates that directly manipulating a highly compressed 1D latent tokenizer enables image editing and generation without auxiliary generative models.
The method employs copy-paste token strategies and gradient-based optimization (e.g., CLIP similarity) to achieve precise text-guided edits and inpainting.
Results indicate that higher compression ratios improve generation quality while reducing computational costs and offering flexible control over image attributes.

Training-Free Image Generation via Token Manipulation

The paper "Highly Compressed Tokenizer Can Generate Without Training" (2506.08257) introduces a novel approach to image editing and generation that leverages the highly compressed latent space of 1D image tokenizers. By manipulating the discrete tokens produced by these tokenizers, the paper demonstrates that it is possible to achieve fine-grained image editing and generate diverse, realistic samples without training any additional generative models. This work challenges the conventional paradigm of image generation, which typically involves training complex generative models on top of image tokenizers.

Background and Motivation

Standard image generation pipelines consist of a tokenizer, which embeds an image into a compact latent space, and a generative model that operates on this latent space. Recently introduced 1D tokenizers, such as TiTok (Gubnitsky et al., 2023), have shown that highly compressed latent spaces can significantly improve generation performance. These tokenizers represent images as one-dimensional sequences of discrete tokens, achieving high compression ratios. This paper explores the extent to which these tokenizers can be viewed as generative models themselves, given their ability to compress images into a small number of tokens. The authors hypothesize that the decoders of highly compressed tokenizers possess inherent generative capabilities, and that these capabilities can be harnessed through direct manipulation of the tokens in the latent space.

Methodology and Experiments

The paper presents a series of experiments to demonstrate the generative capabilities of 1D tokenizers. First, the authors show that simple latent space manipulations, such as copying and replacing tokens between latent representations of images, can result in image editing capabilities typically associated with generative models.

Figure 1: Selected generation results (rows 2, 4, 6) alongside their seeds (rows 1, 3, 5). Corresponding classes shown in table.

Building on this insight, they find that the latent space of 1D tokens is amenable to straightforward gradient-based optimization of various objectives. By optimizing a reconstruction loss, a pretrained tokenizer can perform inpainting tasks. Similarly, by optimizing CLIP similarity (Radford et al., 2021), the tokenizer can perform text-guided image editing (Figure 2).

(Figure 2)

Figure 2: Without further training, a pretrained tokenizer can perform generative tasks such as (a) text-guided editing and (b) inpainting. Because the highly compressed latent space of 1D tokenizers enables generative capabilities via direct token manipulation, we generate using test-time optimization of tokens with (a) CLIP similarity or (b) reconstruction objectives, without training any dedicated generative model.

The authors combine text-guided token optimization with retrieval from a small set of seed images to generate realistic and diverse samples without training any dedicated generative model.

Key Findings

The paper's experiments reveal several key findings:

Token position is key to token semantics: Certain token positions in the 1D latent space correspond to specific interpretable, high-level attributes of the image.
Token perturbations can achieve meaningful edits: Modifying tokens at relevant positions can alter high-level attributes of the image.
"Copy and paste" in latent space enables image editing: A simple copy-and-paste approach, where tokens corresponding to desired attributes are copied from a reference image to a target image, can achieve interpretable and high-quality image edits (Figure 3).

Figure 4: Flexible text-guided editing of an input image using test-time optimization with CLIP gradients. Key attributes of the input image such as pose are preserved while aligning the image with the prompt. Prompts are listed in the appendix.

Gradient-based latent optimization enables text-guided image editing: Maximizing CLIP similarity with respect to a user-supplied text prompt enables visually coherent and prompt-aligned image edits (Figure 5).

Figure 6: “From-scratch” generation starting from random tokens. Instead of starting with an initial image, it is also possible to apply our test-time latent token optimization starting from a randomly sampled initialization. Selected samples for prompts “a dog running on a dirt trail”, “majestic deer standing in a snowy forest”, “a duck in a pond,” and “a wide landscape vista overlooking fields and a town”.

Test-time optimization can be used for inpainting: Optimizing a reconstruction loss over the unmasked part of the image enables successful inpainting.
Generative performance improves with increasing compression: A decreasing number of tokens and a decreasing codebook size lead to significant improvements in generation quality.

Implications and Future Directions

The paper's findings have significant implications for the field of image generation. By demonstrating that highly compressed tokenizers possess inherent generative capabilities, the paper suggests that the traditional paradigm of training separate generative models on top of tokenizers may be unnecessary. The authors propose that scaling tokenizers to even higher compression ratios, larger datasets, or different domains could lead to further improvements in generative performance. The paper's approach has several advantages over traditional generative models:

It does not require training any additional generative models, reducing computational costs.
It allows for flexible control over image attributes through direct manipulation of tokens.
It can be applied to various image editing and generation tasks, such as inpainting and text-guided editing.

However, the approach also has some limitations:

The quality of the generated images depends on the quality of the pretrained tokenizer.
The approach may struggle with generating images of objects or scenes that are not well-represented in the tokenizer's training data.

Figure 7: Qualitatively, the gap in generation quality is very significant between the TiTok 32-token, 64-token, and VQGAN models. Prompts are “a photo of an eagle” and “a photo of a tiger” for the first and second rows, respectively.

Future research could explore ways to improve the generative capabilities of tokenizers, such as by training them with larger datasets or by incorporating additional constraints or objectives. It would be interesting to investigate whether the approach can be extended to other domains, such as video or audio generation.

Conclusion

The paper "Highly Compressed Tokenizer Can Generate Without Training" (2506.08257) presents a compelling case for the generative capabilities of highly compressed 1D image tokenizers. By demonstrating that these tokenizers can be used for image editing and generation without training any additional generative models, the paper challenges the conventional paradigm of image generation and opens up new avenues for future research. The paper's findings suggest that the key to unlocking the full potential of image generation may lie in developing tokenizers that can achieve even higher compression ratios and capture more semantic information in their latent spaces.

Markdown Report Issue