HyperCLOVA X Technical Report (2404.01954v2)
Abstract: We introduce HyperCLOVA X, a family of LLMs tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in developing their sovereign LLMs.
- Towards a cleaner document-oriented multilingual crawled corpus. arXiv preprint arXiv:2201.06642.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Do all languages cost the same? tokenization in the era of commercial language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9904–9923.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245.
- The falcon series of open language models. arXiv preprint arXiv:2311.16867.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
- Program synthesis with large language models. https://github.com/google-research/google-research/blob/master/mbpp/README.md. Accessed: 2024-03-25.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255.
- Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439.
- Findings of the WMT 2023 shared task on quality estimation. In Proceedings of the Eighth Conference on Machine Translation, pages 629–653, Singapore. Association for Computational Linguistics.
- R. A. Bradley and M. E. Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems, volume 35, pages 16344–16359. Curran Associates, Inc.
- Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 862–872, New York, NY, USA. Association for Computing Machinery.
- Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
- A framework for few-shot language model evaluation.
- RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online. Association for Computational Linguistics.
- The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.
- xcomet: Transparent machine translation evaluation through fine-grained error detection. arXiv preprint arXiv:2310.10482.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Measuring mathematical problem solving with the math dataset. NeurIPS.
- The curious case of neural text degeneration. In International Conference on Learning Representations.
- Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456.
- KOLD: Korean offensive language dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10818–10833, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Mistral 7b.
- Kobbq: Korean bias benchmark for question answering. arXiv preprint arXiv:2307.16778.
- TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
- What changes can large-scale language models bring? intensive study on hyperclova: Billions-scale korean generative pretrained transformers. arXiv preprint arXiv:2109.04650.
- Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. arXiv preprint arXiv:2312.15166.
- Click: A benchmark dataset of cultural and linguistic intelligence in korean. arXiv preprint arXiv:2403.06412.
- Efficient and effective vocabulary expansion towards multilingual large language models. arXiv preprint arXiv:2402.14714.
- A technical report for polyglot-ko: Open-source large-scale korean language models. arXiv preprint arXiv:2306.02254.
- The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533.
- Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
- Lee Junbum. 2024. Yi-ko-6b (revision 205083a).
- Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- Common sense beyond English: Evaluating and improving multilingual language models for commonsense reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1274–1287, Online. Association for Computational Linguistics.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
- FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
- Orca: Progressive learning from complex explanation traces of gpt-4.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Bbq: A hand-built bias benchmark for question answering.
- Stabilizing rlhf through advantage model and selective rehearsal. arXiv preprint arXiv:2309.10202.
- True few-shot learning with language models. Advances in neural information processing systems, 34:11054–11070.
- Language model tokenizers introduce unfairness between languages. Advances in Neural Information Processing Systems, 36.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
- Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324.
- Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
- Multilingual instruction tuning with just a pinch of multilinguality. ArXiv, abs/2401.01854.
- A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716.
- Kmmlu: Measuring massive multitask language understanding in korean. arXiv preprint arXiv:2402.11548.
- Hae-rae bench: Evaluation of korean knowledge in language models. arXiv preprint arXiv:2309.02706.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- Learning to summarize from human feedback.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
- Nsml: A machine learning platform that enables you to focus on your models. arXiv preprint arXiv:1712.05902.
- Challenging BIG-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, Toronto, Canada. Association for Computational Linguistics.
- CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models.
- Aya model: An instruction finetuned open-access multilingual language model. arXiv preprint arXiv:2402.07827.
- Attention is all you need. Advances in neural information processing systems, 30.
- Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705.
- Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
- Neural text generation with unlikelihood training. In International Conference on Learning Representations.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
- On diverse preferences for large language model alignment. arXiv preprint arXiv:2312.07401.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Secrets of rlhf in large language models part i: Ppo. arXiv preprint arXiv:2307.04964.
- Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.
- LIMA: Less is more for alignment. In Thirty-seventh Conference on Neural Information Processing Systems.
- Kang Min Yoo (40 papers)
- Jaegeun Han (1 paper)
- Sookyo In (3 papers)
- Heewon Jeon (1 paper)
- Jisu Jeong (24 papers)
- Jaewook Kang (15 papers)
- Hyunwook Kim (1 paper)
- Kyung-Min Kim (25 papers)
- Munhyong Kim (1 paper)
- Sungju Kim (3 papers)
- Donghyun Kwak (12 papers)
- Hanock Kwak (9 papers)
- Se Jung Kwon (26 papers)
- Bado Lee (9 papers)
- Dongsoo Lee (30 papers)
- Gichang Lee (4 papers)
- Jooho Lee (4 papers)
- Baeseong Park (12 papers)
- Seongjin Shin (5 papers)
- Joonsang Yu (13 papers)