Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Komodo: A Linguistic Expedition into Indonesia's Regional Languages (2403.09362v2)

Published 14 Mar 2024 in cs.CL

Abstract: The recent breakthroughs in LLMs have mostly focused on languages with easily available and sufficient resources, such as English. However, there remains a significant gap for languages that lack sufficient linguistic resources in the public domain. Our work introduces Komodo-7B, 7-billion-parameter LLMs designed to address this gap by seamlessly operating across Indonesian, English, and 11 regional languages in Indonesia. Komodo-7B is a family of LLMs that consist of Komodo-7B-Base and Komodo-7B-Instruct. Komodo-7B-Instruct stands out by achieving state-of-the-art performance in various tasks and languages, outperforming the benchmarks set by OpenAI's GPT-3.5, Cohere's Aya-101, Llama-2-Chat-13B, Mixtral-8x7B-Instruct-v0.1, Gemma-7B-it , and many more. This model not only demonstrates superior performance in both language-specific and overall assessments but also highlights its capability to excel in linguistic diversity. Our commitment to advancing LLMs extends beyond well-resourced languages, aiming to bridge the gap for those with limited linguistic assets. Additionally, Komodo-7B-Instruct's better cross-language understanding contributes to addressing educational disparities in Indonesia, offering direct translations from English to 11 regional languages, a significant improvement compared to existing language translation services. Komodo-7B represents a crucial step towards inclusivity and effectiveness in LLMs, providing to the linguistic needs of diverse communities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. AISingapore. Sea-lion (southeast asian languages in one network): A family of large language models for southeast asia. https://github.com/aisingapore/sealion, 2023.
  2. Hate speech detection in the indonesian language: A dataset and preliminary study. In 2017 International Conference on Advanced Computer Science and Information Systems, ICACSIS 2017, pp.  233–237. Institute of Electrical and Electronics Engineers Inc., 2017. 10.1109/ICACSIS.2017.8355039.
  3. Qwen1.5. Technical report, Alibaba, 2024. URL https://qwenlm.github.io/blog/qwen1.5/.
  4. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901, 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  5. Indonlg: Benchmark and resources for evaluating indonesian natural language generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  8875–8898, November 2021. URL https://aclanthology.org/2021.emnlp-main.699.
  6. Tydiqa a benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 8:454–470, 2020. URL https://aclanthology.org/2020.tacl-1.30.
  7. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
  8. Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177, 2023. URL https://arxiv.org/abs/2304.08177.
  9. Google DeepMind. Gemma: Introducing new state-of-the-art open models. Technical report, Google DeepMind, February 2024. URL https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf.
  10. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023. URL https://arxiv.org/pdf/2305.14233.pdf.
  11. Hugging Face. Text generation inference, 2023. URL https://github.com/huggingface/text-generation-inference.
  12. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023. URL https://arxiv.org/pdf/2306.11644.pdf.
  13. John Hewitt. Initializing new word embeddings for pretrained language models, 2021. URL https://nlp.stanford.edu/~johnhew/vocab-expansion.html.
  14. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685.
  15. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024. URL https://arxiv.org/pdf/2401.04088.pdf.
  16. Andrej Karpathy. Tweet on nanogpt optimization, February 2023. URL https://twitter.com/karpathy/status/1621578354024677377.
  17. Large language models only pass primary school exams in indonesia: A comprehensive test on indommlu". In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), December 2023.
  18. Indosum: A new benchmark dataset for indonesian text summarization. In 2018 International Conference on Asian Language Processing (IALP), pp.  215–220, 2018. 10.1109/IALP.2018.8629109.
  19. Attention is all you need. In Proceedings of the 29th Symposium on Operating Systems Principles, pp.  611–626, October 2023. URL https://doi.org/10.1145/3600006.3613165.
  20. Openassistant conversations - democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023. URL https://arxiv.org/pdf/2304.07327.pdf.
  21. Bactrian-x: Multilingual replicable instruction-following models with low-rank adaptation. arXiv preprint arXiv:2305.15011, 2023a. URL https://arxiv.org/pdf/2303.08774.pdf.
  22. Alpacaeval: An automatic evaluator of instruction-following models, 2023b. URL https://github.com/tatsu-lab/alpaca_eval.
  23. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023c. URL https://arxiv.org/pdf/2309.05463.pdf.
  24. Chenghaomou/text-dedup: Reference snapshot, 2023. URL https://doi.org/10.5281/zenodo.8364980.
  25. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. URL https://arxiv.org/pdf/2303.08774.pdf.
  26. Xcopa: A multilingual dataset for causal commonsense reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020. URL https://ducdauge.github.io/files/xcopa.pdf.
  27. Scaling language models: Methods, analysis and insights from training gopher, 2022. URL https://arxiv.org/abs/2112.11446.
  28. SarvamAI. Openhathi series: An approach to build bilingual llms frugally. Technical report, SarvamAI, 2023. URL https://www.sarvam.ai/blog/announcing-openhathi-series.
  29. A framework for few-shot language model evaluation, 2023. URL https://zenodo.org/records/10256836.
  30. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  31. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. URL https://arxiv.org/pdf/2307.09288.pdf.
  32. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  33. Indonlu: Benchmark and resources for evaluating indonesian natural language understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 2020.
  34. Nusax: Multilingual parallel sentiment dataset for 10 indonesian local languages, 2023. URL https://arxiv.org/abs/2205.15960.
  35. Google’s neural machine translation system: Bridging the gap between human and machine translation, 2016. URL https://arxiv.org/abs/1609.08144.
  36. Are pretrained transformers robust in intent classification? a missing ingredient in evaluation of out-of-scope intent detection, 2022. URL https://arxiv.org/abs/2106.04564.
  37. Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2023. URL https://arxiv.org/abs/2309.11998.
  38. Aya model: An instruction finetuned open-access multilingual language model. arXiv preprint arXiv:2402.07827, 2024. URL https://arxiv.org/pdf/2402.07827.pdf.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Louis Owen (5 papers)
  2. Vishesh Tripathi (4 papers)
  3. Abhay Kumar (28 papers)
  4. Biddwan Ahmed (5 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com