Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Subobject-level Image Tokenization (2402.14327v2)

Published 22 Feb 2024 in cs.CV and cs.CL

Abstract: Transformer-based vision models typically tokenize images into fixed-size square patches as input units, which lacks the adaptability to image content and overlooks the inherent pixel grouping structure. Inspired by the subword tokenization widely adopted in LLMs, we propose an image tokenizer at a subobject level, where the subobjects are represented by semantically meaningful image segments obtained by segmentation models (e.g., segment anything models). To implement a learning system based on subobject tokenization, we first introduced a Direct Segment Anything Model (DirectSAM) that efficiently produces comprehensive segmentation of subobjects, then embed subobjects into compact latent vectors and fed them into a LLM for vision language learning. Empirical results demonstrated that our subobject-level tokenization significantly facilitates efficient learning of translating images into object and attribute descriptions compared to the traditional patch-level tokenization. Codes and models are open-sourced at https://github.com/ChenDelong1999/subobjects.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP. CoRR, abs/2112.10508, 2021.
  2. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics, 2016.
  3. Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 66–75. Association for Computational Linguistics, 2018.
  4. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Eduardo Blanco and Wei Lu, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018, pages 66–71. Association for Computational Linguistics, 2018.
  5. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 1691–1703. PMLR, 2020.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  7. Character n-gram tokenization for european language text retrieval. Inf. Retr. Boston., 7(1/2):73–97, January 2004.
  8. Segment anything. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 3992–4003. IEEE, 2023.
  9. Learning a classification model for segmentation. In 9th IEEE International Conference on Computer Vision (ICCV 2003), 14-17 October 2003, Nice, France, pages 10–17. IEEE Computer Society, 2003.
  10. Superpixels: An evaluation of the state-of-the-art. Comput. Vis. Image Underst., 166:1–27, 2018.
  11. Partimagenet: A large, high-quality dataset of parts. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part VIII, volume 13668 of Lecture Notes in Computer Science, pages 128–145. Springer, 2022.
  12. Panoptic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 9404–9413. Computer Vision Foundation / IEEE, 2019.
  13. Image as a foreign language: BEIT pretraining for vision and vision-language tasks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 19175–19186. IEEE, 2023.
  14. Textbooks are all you need II: phi-1.5 technical report. CoRR, abs/2309.05463, 2023.
  15. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1988–1997. IEEE Computer Society, 2017.
  16. Adept Team. Fuyu-8B: A Multimodal Architecture for AI Agents. https://www.adept.ai/blog/fuyu-8b, 2023. [Accessed 19-02-2024].
  17. Faster segment anything: Towards lightweight SAM for mobile applications. CoRR, abs/2306.14289, 2023.
  18. Mobilesamv2: Faster segment anything to everything. CoRR, abs/2312.09579, 2023.
  19. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
  20. Flamingo: a visual language model for few-shot learning. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  21. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
Citations (5)

Summary

  • The paper introduces a novel subobject-level tokenization method that segments images into semantically grouped parts, similar to subword tokenization in NLP.
  • The methodology employs a Sequence-to-Sequence AutoEncoder (SeqAE) to compress irregular image segments into robust embedding vectors.
  • Integrating subobject embeddings with a large vision-language model significantly boosts efficiency and accuracy on benchmarks like SA-1B and CLEVR.

Subobject-level Image Tokenization: Enhancing Vision-LLMs

This paper addresses a critical concern in the current paradigm of vision-LLMs that utilize transformer architectures. Traditional methodologies tokenize images into fixed-size square patches without adapting to image content, consequently ignoring the inherent pixel grouping structure. In response, the authors propose a novel approach to image tokenization at a subobject level, reminiscent of subword tokenization in NLP. The method aims to leverage semantically meaningful image segments, or “subobjects,” achieved through segmentation models, for improved efficiency and accuracy in vision-language tasks.

Key Innovations

  • Subobject-Level Tokenization: The central proposition is to tokenize images into subobjects, akin to subword tokenization in text, thus bridging the gap between pixel-level and object-level representations. This approach is informed by advancements in image segmentation, particularly models like the Segment Anything Model (SAM). Subobject tokenization addresses inefficiencies in the prevalent patch-based methods, which are analogous to ineffective character-level tokenizations in NLP.
  • Sequence-to-Sequence AutoEncoder (SeqAE): To facilitate the transformation of subobject segments into compact representations, the authors introduce SeqAE. This model compresses segments of varying shapes into embedding vectors, maintaining a rich representation of visual data without unnecessary downsampling. The SeqAE framework enables the handling of irregular segment sizes more efficiently than conventional techniques.
  • Large Vision LLM (LVLM) Integration: The paper describes an LVLM architecture that incorporates these subobject embeddings, integrating them with a LLM. The subobject tokens are treated similarly to textual subword tokens, with additional positional embeddings to account for their two-dimensional nature.

Empirical Results

The authors substantiate their claims through empirical evaluations on datasets such as SA-1B and CLEVR. The SeqAE model is pre-trained on the SA-1B dataset for robust subobject embeddings, while the LVLM is assessed on CLEVR for image captioning tasks. The results are compelling: subobject-level tokenization significantly expedites the learning process and enhances the model's accuracies in identifying object attributes and counts.

Implications and Future Prospects

From a practical standpoint, subobject-level tokenization presents an opportunity to enhance the efficiency of vision-LLMs significantly. It aligns with the increasing demand for systems that can effectively process visual information with semantic understanding, a crucial aspect of intelligent systems. Theoretically, it opens new research avenues in tokenization strategies that consider the semantic granularity of inputs, potentially applicable beyond vision tasks. Future work may explore expanding the subobject tokenization approach to varied domains and integrating it with increasingly sophisticated LLMs to achieve higher levels of contextual understanding and generation in multimodal AI systems.

The paper offers a scholarly contribution to the vision-LLMing domain by challenging the entrenched methodologies and presenting a robust alternative that aligns well with contemporary LLM practices. This work adds a significant layer of interpretability and efficiency, potentially setting a new standard for how image data is structured and processed in advanced AI systems.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com