Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adaptive Length Image Tokenization via Recurrent Allocation (2411.02393v1)

Published 4 Nov 2024 in cs.CV, cs.AI, cs.LG, and cs.RO
Adaptive Length Image Tokenization via Recurrent Allocation

Abstract: Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even LLMs - which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.

Adaptive Length Image Tokenization via Recurrent Allocation: An Expert Analysis

The paper "Adaptive Length Image Tokenization via Recurrent Allocation" presents an innovative approach to image representation that bridges the gap between current image tokenization methods and the sophisticated and adaptable processing observed in human cognition and LLMs. This work introduces the Adaptive Length Image Tokenizer (ALIT), a mechanism that leverages recurrent neural architecture to recursively process 2D image tokens, transforming them into a variable number of 1D latent tokens through iterative rollouts. This novel implementation allows for adaptive tokenization based on image complexity, entropy, and familiarity with the training data, thus optimizing the representation capacity per image.

Technical Approach

The paper notably critiques fixed-length tokenization methods prevalent in modern visual systems, such as VAEs, VQGANs, and ViTs, where the number of visual tokens is constant and human-engineered, failing to account for the inherent diversity and complexity in image datasets. The proposed solution diverges from this trend by adopting an adaptive length strategy, incorporating recurrent computation inspired by models like the Perceiver. The encoder-decoder framework iteratively refines image tokens, adding new latent tokens in each iteration, which enhances representational efficiency and reflects the image's entropy and content familiarity.

Through recursive processing, the same neural network architecture allows the dynamic adjustment of representational depth without being constrained by the 2D inductive bias, unlike traditional patch-to-token transformations. This aligns with the concept of latent token distillation where 2D image tokens are compressed into compact 1D representations retaining the most salient features.

Empirical Analysis

The ALIT system was evaluated primarily using reconstruction loss and Fréchet Inception Distance (FID) metrics, showcasing competitive performance with state-of-the-art tokenizers. The experimental results indicate that the system can generate reconstruction metrics comparable to both 2D VQGANs and fixed-latent 1D tokenizers like Titok. A notable finding is that image reconstruction, using ALIT, achieved results consistent with human-annotated complexity expectations, supporting the hypothesis that image complexity correlates with token count requirements.

Implications and Future Work

The implications of this paper are twofold: practical and theoretical. Practically, the adaptive token allocation mechanism accommodates image-specific compression needs, particularly beneficial for tasks requiring fine-grained image processing, such as object/part discovery. Theoretically, the capability to represent visual input adaptively can contribute to developing more general vision transformers and enhance the integration of modalities in multi-modal AI systems. As the architecture is further refined and scaled, it may prove instrumental in revolutionizing long-horizon video understanding and adaptive visual-abstract reasoning, where static representations fall short.

In terms of future research, training larger instances of ALIT on extensive datasets such as LAION and further integrating the architecture with generative or reasoning tasks in the AI domain could uncover additional applications. Exploring adaptive tokenization beyond static image datasets, possibly extending its utility to video or streaming data, presents a promising direction.

Ultimately, this paper provides a robust framework for adaptive image tokenization, emphasizing the potential for more efficient and nuanced visual representation learning strategies that better mimic flexible human-like processing systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Adaptive input representations for neural language modeling. CoRR, abs/1809.10853, 2018. URL http://arxiv.org/abs/1809.10853.
  2. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013. doi: 10.1109/TPAMI.2013.50.
  3. Flexivit: One model for all patch sizes. arXiv preprint arXiv:2212.08013, 2022.
  4. Token merging: Your ViT but faster. In International Conference on Learning Representations, 2023.
  5. Matryoshka multimodal models. arXiv preprint arXiv:2405.17430, 2024.
  6. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
  7. An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021.
  8. François Chollet. On the measure of intelligence, 2019. URL https://arxiv.org/abs/1911.01547.
  9. Universal transformers. ArXiv, abs/1807.03819, 2018. URL https://api.semanticscholar.org/CorpusID:49667762.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. CoRR, abs/2010.11929, 2020. URL https://arxiv.org/abs/2010.11929.
  11. Taming transformers for high-resolution image synthesis, 2020.
  12. Think before you speak: Training language models with pause tokens, 2024. URL https://arxiv.org/abs/2310.02226.
  13. Alex Graves. Adaptive computation time for recurrent neural networks. ArXiv, abs/1603.08983, 2016. URL https://api.semanticscholar.org/CorpusID:8224916.
  14. Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021.
  15. Thinking tokens for language modeling, 2024. URL https://arxiv.org/abs/2405.08644.
  16. Matryoshka query transformer for large vision-language models, 2024.
  17. Straightening out the straight-through estimator: Overcoming optimization challenges in vector quantized networks, 2023. URL https://arxiv.org/abs/2305.08842.
  18. Marcus Hutter. The hutter prize. http://prize.hutter1.net, 2006.
  19. Scalable adaptive computation for iterative generation, 2022.
  20. Scalable adaptive computation for iterative generation, 2023. URL https://arxiv.org/abs/2212.11972.
  21. Perceiver IO: A general architecture for structured inputs & outputs. CoRR, abs/2107.14795, 2021a. URL https://arxiv.org/abs/2107.14795.
  22. Perceiver: General perception with iterative attention, 2021b. URL https://arxiv.org/abs/2103.03206.
  23. Mixture of nested experts: Adaptive processing of visual tokens, 2024. URL https://arxiv.org/abs/2407.19985.
  24. Auto-encoding variational bayes, 2022. URL https://arxiv.org/abs/1312.6114.
  25. Matryoshka representation learning. In Advances in Neural Information Processing Systems, December 2022.
  26. Universal intelligence: A definition of machine intelligence. CoRR, abs/0712.3329, 2007. URL http://arxiv.org/abs/0712.3329.
  27. Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  28. Dynamicvit: Efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  29. Savoias: A diverse, multi-category visual complexity dataset. arXiv preprint arXiv:1810.01771, 2018.
  30. Jürgen Schmidhuber. Low-complexity art. Leonardo, 30(2):97–103, 1996. doi: 10.2307/1576418.
  31. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks, 2021. URL https://arxiv.org/abs/2106.04537.
  32. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, pp.  2443–2449, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380379. doi: 10.1145/3404835.3463257. URL https://doi.org/10.1145/3404835.3463257.
  33. Understanding contrastive representation learning through alignment and uniformity on the hypersphere, 2022. URL https://arxiv.org/abs/2005.10242.
  34. Detecting people in artwork with cnns. CoRR, abs/1610.08871, 2016. URL http://arxiv.org/abs/1610.08871.
  35. Adaptive computation with elastic input sequence, 2023. URL https://arxiv.org/abs/2301.13195.
  36. Elastictok: Adaptive tokenization for image and video. arXiv preprint, 2024.
  37. A-ViT: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  38. Vector-quantized image modeling with improved VQGAN. CoRR, abs/2110.04627, 2021. URL https://arxiv.org/abs/2110.04627.
  39. An image is worth 32 tokens for reconstruction and generation. arxiv: 2406.07550, 2024.
  40. Scaling the codebook size of vqgan to 100,000 with a utilization rate of 99 URL https://arxiv.org/abs/2406.11837.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Shivam Duggal (9 papers)
  2. Phillip Isola (84 papers)
  3. Antonio Torralba (178 papers)
  4. William T. Freeman (114 papers)