Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LaVy: Vietnamese Multimodal Large Language Model (2404.07922v6)

Published 11 Apr 2024 in cs.CL, cs.CV, and cs.LG

Abstract: LLMs and Multimodal LLMs (MLLMs) have taken the world by storm with impressive abilities in complex reasoning and linguistic comprehension. Meanwhile there are plethora of works related to Vietnamese LLMs, the lack of high-quality resources in multimodality limits the progress of Vietnamese MLLMs. In this paper, we pioneer in address this by introducing LaVy, a state-of-the-art Vietnamese MLLM, and we also introduce LaVy-Bench benchmark designated for evaluating MLLMs' understanding on Vietnamese visual language tasks. Our project is public at https://github.com/baochi0212/LaVy

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
  3. Peacock: A family of arabic multimodal large language models and benchmarks. arXiv preprint arXiv:2403.01031.
  4. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  5. mblip: Efficient bootstrapping of multilingual vision-llms. arXiv preprint arXiv:2307.06930.
  6. Efficient multimodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530.
  7. Mistral 7b. arXiv preprint arXiv:2310.06825.
  8. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 19730–19742. PMLR.
  9. Yen-Ting Lin and Yun-Nung Chen. 2023. Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. arXiv preprint arXiv:2305.13711.
  10. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems.
  11. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295.
  12. Vistral-7b-chat - towards a state-of-the-art large language model for vietnamese.
  13. Phogpt: Generative pre-training for vietnamese. arXiv preprint arXiv:2311.02945.
  14. Openvivqa: Task and dataset and and multimodal fusion models for visual question answering in vietnamese. arXiv preprint arXiv:2305.04183.
  15. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Associations for Computational Linguistics (ACL).
  16. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  17. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  18. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NIPS 2017).
  19. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  20. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Chi Tran (6 papers)
  2. Huong Le Thanh (1 paper)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com