GECKO: Generative Language Model for English, Code and Korean (2405.15640v1)
Abstract: We introduce GECKO, a bilingual LLM optimized for Korean and English, along with programming languages. GECKO is pretrained on the balanced, high-quality corpus of Korean and English employing LLaMA architecture. In this report, we share the experiences of several efforts to build a better data pipeline for the corpus and to train our model. GECKO shows great efficiency in token generations for both Korean and English, despite its small size of vocabulary. We measure the performance on the representative benchmarks in terms of Korean, English and Code, and it exhibits great performance on KMMLU (Korean MMLU) and modest performance in English and Code, even with its smaller number of trained tokens compared to English-focused LLMs. GECKO is available to the open-source community under a permissive license. We hope our work offers a research baseline and practical insights for Korean LLM research. The model can be found at: https://huggingface.co/kifai/GECKO-7B
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Jax: composable transformations of python+ numpy programs. 2018.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
- Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm. Company Blog of Databricks, 2023.
- Unicode normalization forms, 2001.
- Compiling machine learning programs via high-level tracing. Systems for Machine Learning, 4(9), 2018.
- How does gpt obtain its ability? tracing emergent abilities of language models to their sources. Yao Fu’s Notion, 2022.
- Finding the optimal vocabulary size for neural machine translation. arXiv preprint arXiv:2004.02334, 2020.
- Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pages 10697–10707. PMLR, 2022.
- Preserving privacy through dememorization: An unlearning technique for mitigating memorization risks in language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4360–4379, 2023.
- Kogpt: Kakaobrain korean(hangul) generative pre-trained transformer. https://github.com/kakaobrain/kogpt, 2021.
- A technical report for polyglot-ko: Open-source large-scale korean language models. arXiv preprint arXiv:2306.02254, 2023.
- The stack: 3 tb of permissively licensed source code. Preprint, 2022.
- L. Junbum. llama-2-ko-7b (revision 4a9993e), 2023.
- Bloom: A 176b-parameter open-access multilingual language model. 2023.
- Junbum Lee. Kcbert: Korean comments bert. In Annual Conference on Human and Language Technology, pages 437–440. Human and Language Technology, 2020.
- Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021.
- Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
- Large language models surpass human experts in predicting neuroscience results. arXiv preprint arXiv:2403.03230, 2024.
- Poro 34b and the blessing of multilinguality. arXiv preprint arXiv:2404.01856, 2024.
- Claude Models. Model card and evaluations for claude models, 2023.
- Jangwon Park. Koelectra: Pretrained electra model for korean. https://github.com/monologg/KoELECTRA, 2020.
- Fineweb, 2024.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
- Typhoon: Thai large language models. arXiv preprint arXiv:2312.13951, 2023.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
- Analysing mathematical reasoning abilities of neural models. arXiv:1904.01557, 2019.
- AI Singapore. Sea-lion (southeast asian languages in one network): A family of large language models for southeast asia. https://github.com/aisingapore/sealion, 2023.
- Dolma: An open corpus of three trillion tokens for language model pretraining research. arXiv preprint arXiv:2402.00159, 2024.
- Kmmlu: Measuring massive multitask language understanding in korean. arXiv preprint arXiv:2402.11548, 2024.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
- Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
- Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
- Sungwoo Oh (1 paper)
- Donggyu Kim (57 papers)