Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks (2403.01031v2)
Abstract: Multimodal LLMs (MLLMs) have proven effective in a wide range of tasks requiring complex reasoning and linguistic comprehension. However, due to a lack of high-quality multimodal resources in languages other than English, success of MLLMs remains relatively limited to English-based settings. This poses significant challenges in developing comparable models for other languages, including even those with large speaker populations such as Arabic. To alleviate this challenge, we introduce a comprehensive family of Arabic MLLMs, dubbed \textit{Peacock}, with strong vision and language capabilities. Through comprehensive qualitative and quantitative analysis, we demonstrate the solid performance of our models on various visual reasoning tasks and further show their emerging dialectal potential. Additionally, we introduce ~\textit{Henna}, a new benchmark specifically designed for assessing MLLMs on aspects related to Arabic culture, setting the first stone for culturally-aware Arabic MLLMs.The GitHub repository for the \textit{Peacock} project is available at \url{https://github.com/UBC-NLP/peacock}.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Aracap: A hybrid deep learning architecture for arabic image captioning. Procedia Computer Science, 189:382–389.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- Anfal Attai and Ashraf Elnagar. 2020. A survey on arabic image captioning systems using deep learning models. In 2020 14th International Conference on Innovations in Information Technology (IIT), pages 114–119. IEEE.
- Fuyu-8b: A multimodal architecture for ai agents. https://www.adept.ai/blog/fuyu-8b. Accessed: 2024-01-01.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568.
- ChatGPT. 2024. Openai. www.openai.com. Version GPT-4.
- Multilingualsift: Multilingual supervised instruction fine-tuning, july 2023b. URL https://github. com/FreedomIntelligence/MultilingualSIFT. git.
- No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arxiv 2023. arXiv preprint arXiv:2305.06500.
- Instructblip: Towards general-purpose vision-language models with instruction tuning.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Muhy Eddin Za’ter and Bashar Talafha. 2022. Bench-marking and improving arabic automatic image captioning through the use of multi-task learning paradigm. arXiv e-prints, pages arXiv–2202.
- Ibrahim Abu El-Khair. 2016. 1.5 billion words arabic corpus. arXiv preprint arXiv:1611.04033.
- Improved arabic image captioning model using feature concatenation with pre-trained word embedding. Neural Computing and Applications, pages 1–17.
- Resources and end-to-end neural network models for arabic image captioning. In VISIGRAPP (5: VISAPP), pages 233–241.
- Arabic image captioning using pre-training of deep bidirectional transformers. In Proceedings of the 15th International Conference on Natural Language Generation, pages 40–51.
- Language-agnostic bert sentence embedding. arXiv preprint arXiv:2007.01852.
- mblip: Efficient bootstrapping of multilingual vision-llms. arXiv preprint arXiv:2307.06930.
- Google Cloud. Google translation api. https://cloud.google.com/translate. Accessed: 2023-12-01.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913.
- Textbooks are all you need. arXiv preprint arXiv:2306.11644.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Acegpt, localizing large language models in arabic.
- Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Tarjamat: Evaluation of bard and chatgpt on machine translation of ten arabic varieties. arXiv preprint arXiv:2308.03051.
- Vaqa: Visual arabic question answering. Arabian Journal for Science and Engineering, pages 1–21.
- Platypus: Quick, cheap, and powerful refinement of llms. arXiv preprint arXiv:2308.07317.
- Otterhd: A high-resolution multi-modality model. arXiv preprint arXiv:2311.04219.
- Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTit: A large-scale dataset towards multi-modal multilingual instruction tuning.
- Colossal-ai: A unified deep learning system for large-scale parallel training. In Proceedings of the 52nd International Conference on Parallel Processing, pages 766–775.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
- Improved baselines with visual instruction tuning.
- Visual instruction tuning.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204.
- Violet: A vision-language model for arabic image captioning with gemini decoder. arXiv preprint arXiv:2311.08844.
- Jasmine: Arabic gpt models for few-shot learning. arXiv preprint arXiv:2212.10755.
- Machine generation and detection of arabic manipulated and fake news. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, pages 69–84.
- Openiti: A machine-readable corpus of islamicate texts. nd http://doi. org/10.5281/zenodo, 4075046.
- OpenAI. 2023. Gpt-4v(ision) system card.
- Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Learning transferable visual models from natural language supervision.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114.
- Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. arXiv preprint arXiv:2308.16149.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
- Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Attention is all you need. Advances in neural information processing systems, 30.
- Wikipedia contributors. 2024. Wikipedia: The Free Encyclopedia. https://www.wikipedia.org/. [Online; accessed 2024-01-15].
- Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. arXiv preprint arXiv:2212.10773.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1).
- A survey on multimodal large language models. arXiv preprint arXiv:2306.13549.
- Osian: Open source international arabic news corpus-preparation and integration into the clarin-infrastructure. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 175–182.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
- Fakhraddin Alwajih (11 papers)
- El Moatez Billah Nagoudi (31 papers)
- Gagan Bhatia (12 papers)
- Abdelrahman Mohamed (59 papers)
- Muhammad Abdul-Mageed (102 papers)