PALO: A Polyglot Large Multimodal Model for 5B People (2402.14818v2)
Abstract: In pursuit of more inclusive Vision-LLMs (VLMs), this study introduces a Large Multilingual Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of ~5B people (65% of the world population). Our approach involves a semi-automated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned LLM, thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three scales (1.7B, 7B and 13B parameters) to show the generalization and scalability where we observe substantial improvements compared to strong baselines. We also propose the first multilingual multimodal benchmark for the forthcoming approaches to evaluate their vision-language reasoning capabilities across languages. Code: https://github.com/mbzuai-oryx/PALO.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- The falcon series of open language models. arXiv preprint arXiv:2311.16867.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
- Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886.
- Together Computer. 2023. Redpajama: An open source recipe to reproduce llama training dataset.
- No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv:2305.06500.
- Ethnologue: Languages of the world.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709.
- Huggingface. Huggingface dataset. https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered.
- The bigscience roots corpus: A 1.6 tb composite multilingual dataset. Advances in Neural Information Processing Systems, 35:31809–31826.
- Bloom: A 176b-parameter open-access multilingual language model. corr, abs/2211.05100, 2022. doi: 10.48550. arXiv preprint arXiv.2211.05100.
- Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122.
- Improved baselines with visual instruction tuning. arXiv:2310.03744.
- Visual instruction tuning. In NeurIPS.
- Ziya-visual: Bilingual large vision-language model via multi-task instruction tuning. arXiv e-prints, pages arXiv–2310.
- Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv:2306.05424.
- OpenAI. Openai terms of use. https://openai.com/policies/terms-of-use.
- Learning transferable visual models from natural language supervision.
- Glamm: Pixel grounding large multimodal model. ArXiv 2311.03356.
- What language model to train if you have one million gpu hours? arXiv preprint arXiv:2210.15424.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Polylm: An open source polyglot large language model. arXiv preprint arXiv:2307.06018.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv:2306.05685.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.