Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-modal Auto-regressive Modeling via Visual Words (2403.07720v2)

Published 12 Mar 2024 in cs.CV and cs.AI

Abstract: LLMs, benefiting from the auto-regressive modelling approach performed on massive unannotated texts corpora, demonstrates powerful perceptual and reasoning capabilities. However, as for extending auto-regressive modelling to multi-modal scenarios to build Large Multi-modal Models (LMMs), there lies a great difficulty that the image information is processed in the LMM as continuous visual embeddings, which cannot obtain discrete supervised labels for classification.In this paper, we successfully perform multi-modal auto-regressive modeling with a unified objective for the first time.Specifically, we propose the concept of visual tokens, which maps the visual features to probability distributions over LLM's vocabulary, providing supervision information for visual modelling.We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.Experimental results and ablation studies on 5 VQA tasks and 4 benchmark toolkits validate the powerful performance of our proposed approach.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. CoRR, abs/2308.12966.
  3. Sequential modeling enables scalable learning for large vision models. CoRR, abs/2312.00785.
  4. Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR, abs/2305.06500.
  5. Dreamllm: Synergistic multimodal comprehension and creation. CoRR, abs/2309.11499.
  6. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. Int. J. Comput. Vis., 127(4):398–414.
  7. Vizwiz grand challenge: Answering visual questions from blind people. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 3608–3617. Computer Vision Foundation / IEEE Computer Society.
  8. Language is not all you need: Aligning perception with language models. CoRR, abs/2302.14045.
  9. Drew A. Hudson and Christopher D. Manning. 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 6700–6709. Computer Vision Foundation / IEEE.
  10. IDEFICS. 2023. Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics.
  11. Mistral 7b. CoRR, abs/2310.06825.
  12. Unified language-vision pretraining in LLM with dynamic discrete visual tokenization. CoRR, abs/2309.04669.
  13. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 19730–19742. PMLR.
  14. Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, Singapore. Association for Computational Linguistics.
  15. Moe-llava: Mixture of experts for large vision-language models. CoRR, abs/2401.15947.
  16. Improved baselines with visual instruction tuning. CoRR, abs/2310.03744.
  17. Visual instruction tuning. CoRR, abs/2304.08485.
  18. Mmbench: Is your multi-modal model an all-around player? CoRR, abs/2307.06281.
  19. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. CoRR, abs/2312.17172.
  20. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  21. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  22. Towards VQA models that can read. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 8317–8326. Computer Vision Foundation / IEEE.
  23. Generative multimodal models are in-context learners. CoRR, abs/2312.13286.
  24. Generative pretraining in multimodality. CoRR, abs/2307.05222.
  25. Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer. CoRR, abs/2401.10208.
  26. Vicuna. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org/.
  27. Mm-vet: Evaluating large multimodal models for integrated capabilities. CoRR, abs/2308.02490.
  28. VL-GPT: A generative pre-trained transformer for vision and language understanding and generation. CoRR, abs/2312.09251.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Tianshuo Peng (10 papers)
  2. Zuchao Li (76 papers)
  3. Lefei Zhang (64 papers)
  4. Hai Zhao (227 papers)
  5. Ping Wang (288 papers)
  6. Bo Du (263 papers)
Citations (3)