Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HyperCLOVA X Technical Report (2404.01954v2)

Published 2 Apr 2024 in cs.CL and cs.AI

Abstract: We introduce HyperCLOVA X, a family of LLMs tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in developing their sovereign LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. Towards a cleaner document-oriented multilingual crawled corpus. arXiv preprint arXiv:2201.06642.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  3. Do all languages cost the same? tokenization in the era of commercial language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9904–9923.
  4. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245.
  5. The falcon series of open language models. arXiv preprint arXiv:2311.16867.
  6. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
  7. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  8. Program synthesis with large language models. https://github.com/google-research/google-research/blob/master/mbpp/README.md. Accessed: 2024-03-25.
  9. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  10. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255.
  11. Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439.
  12. Findings of the WMT 2023 shared task on quality estimation. In Proceedings of the Eighth Conference on Machine Translation, pages 629–653, Singapore. Association for Computational Linguistics.
  13. R. A. Bradley and M. E. Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
  14. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  15. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  16. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  17. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  18. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
  19. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems, volume 35, pages 16344–16359. Curran Associates, Inc.
  20. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 862–872, New York, NY, USA. Association for Computing Machinery.
  21. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387.
  22. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
  23. A framework for few-shot language model evaluation.
  24. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online. Association for Computational Linguistics.
  25. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.
  26. xcomet: Transparent machine translation evaluation through fine-grained error detection. arXiv preprint arXiv:2310.10482.
  27. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  28. Measuring mathematical problem solving with the math dataset. NeurIPS.
  29. The curious case of neural text degeneration. In International Conference on Learning Representations.
  30. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456.
  31. KOLD: Korean offensive language dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10818–10833, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  32. Mistral 7b.
  33. Kobbq: Korean bias benchmark for question answering. arXiv preprint arXiv:2307.16778.
  34. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
  35. What changes can large-scale language models bring? intensive study on hyperclova: Billions-scale korean generative pretrained transformers. arXiv preprint arXiv:2109.04650.
  36. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. arXiv preprint arXiv:2312.15166.
  37. Click: A benchmark dataset of cultural and linguistic intelligence in korean. arXiv preprint arXiv:2403.06412.
  38. Efficient and effective vocabulary expansion towards multilingual large language models. arXiv preprint arXiv:2402.14714.
  39. A technical report for polyglot-ko: Open-source large-scale korean language models. arXiv preprint arXiv:2306.02254.
  40. The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533.
  41. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  42. Lee Junbum. 2024. Yi-ko-6b (revision 205083a).
  43. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871.
  44. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  45. Common sense beyond English: Evaluating and improving multilingual language models for commonsense reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1274–1287, Online. Association for Computational Linguistics.
  46. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
  47. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
  48. Orca: Progressive learning from complex explanation traces of gpt-4.
  49. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  50. Bbq: A hand-built bias benchmark for question answering.
  51. Stabilizing rlhf through advantage model and selective rehearsal. arXiv preprint arXiv:2309.10202.
  52. True few-shot learning with language models. Advances in neural information processing systems, 34:11054–11070.
  53. Language model tokenizers introduce unfairness between languages. Advances in Neural Information Processing Systems, 36.
  54. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
  55. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106.
  56. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  57. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324.
  58. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
  59. Multilingual instruction tuning with just a pinch of multilinguality. ArXiv, abs/2401.01854.
  60. A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716.
  61. Kmmlu: Measuring massive multitask language understanding in korean. arXiv preprint arXiv:2402.11548.
  62. Hae-rae bench: Evaluation of korean knowledge in language models. arXiv preprint arXiv:2309.02706.
  63. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  64. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  65. Learning to summarize from human feedback.
  66. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  67. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
  68. Nsml: A machine learning platform that enables you to focus on your models. arXiv preprint arXiv:1712.05902.
  69. Challenging BIG-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, Toronto, Canada. Association for Computational Linguistics.
  70. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  71. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295.
  72. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  73. Llama 2: Open foundation and fine-tuned chat models.
  74. Aya model: An instruction finetuned open-access multilingual language model. arXiv preprint arXiv:2402.07827.
  75. Attention is all you need. Advances in neural information processing systems, 30.
  76. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
  77. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705.
  78. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
  79. Neural text generation with unlikelihood training. In International Conference on Learning Representations.
  80. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
  81. On diverse preferences for large language model alignment. arXiv preprint arXiv:2312.07401.
  82. Judging llm-as-a-judge with mt-bench and chatbot arena.
  83. Secrets of rlhf in large language models part i: Ppo. arXiv preprint arXiv:2307.04964.
  84. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.
  85. LIMA: Less is more for alignment. In Thirty-seventh Conference on Neural Information Processing Systems.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (396)
  1. Kang Min Yoo (40 papers)
  2. Jaegeun Han (1 paper)
  3. Sookyo In (3 papers)
  4. Heewon Jeon (1 paper)
  5. Jisu Jeong (24 papers)
  6. Jaewook Kang (15 papers)
  7. Hyunwook Kim (1 paper)
  8. Kyung-Min Kim (25 papers)
  9. Munhyong Kim (1 paper)
  10. Sungju Kim (3 papers)
  11. Donghyun Kwak (12 papers)
  12. Hanock Kwak (9 papers)
  13. Se Jung Kwon (26 papers)
  14. Bado Lee (9 papers)
  15. Dongsoo Lee (30 papers)
  16. Gichang Lee (4 papers)
  17. Jooho Lee (4 papers)
  18. Baeseong Park (12 papers)
  19. Seongjin Shin (5 papers)
  20. Joonsang Yu (13 papers)
Citations (6)

Summary

  • The paper introduces an innovative training methodology, including pre-normalization and grouped-query attention, to enhance bilingual performance.
  • The paper demonstrates superior benchmark results, excelling in both Korean language tasks and cross-lingual inference challenges.
  • The paper emphasizes rigorous safety and ethical practices, incorporating extensive evaluations and dedicated ethics principles to mitigate biases.

HyperCLOVA X: Advancing Korean-centric LLMs with Multilingual Capabilities

Training Details

HyperCLOVA X encompasses HCX-L and HCX-S models, marking a significant leap in LLMs concentrated on the Korean language and culture. This advancement is achieved through an innovative training methodology, starting with an evenly distributed mix of Korean, English, and programming language data. A notable distinction lies in the adoption of pre-normalization and grouped-query attention mechanisms alongside the rotary position embeddings, enhancing model robustness and length handling capabilities. The pretraining corpus, reflecting a meticulous compilation process, ensures a balanced representation of high-quality, diverse content excluding low-quality, repetitive, or sensitive information. This comprehensive approach not only refines the quality of training data but significantly contributes to the model's performance in understanding and generating content in both Korean and English.

Benchmark Performance

HyperCLOVA X's prowess is evident across a range of benchmarks designed to evaluate reasoning, knowledge encapsulation, and language understanding capabilities. Distinguished performance on comprehensive Korean benchmarks underscores its profound comprehension of Korean cultural and societal nuances. When juxtaposed with models focusing either on Korean or general foundations, HyperCLOVA X demonstrates noteworthy superiority, particularly in tasks requiring nuanced understanding and knowledge application. Its performance on core English-language benchmarks further reinforces its bilingual capabilities, facilitating cross-cultural exchange and understanding.

Multilingual Abilities

The inherent bilingual design is extended to accommodate multilingualism, a feat highlighted through machine translation and cross-lingual inference tasks. HyperCLOVA X exemplifies state-of-the-art machine translation performance between Korean and other widely used languages in Korea, including Japanese and Chinese. This attribute is paramount in environments demanding fluency across multiple languages, offering substantial assistance in real-world application scenarios ranging from academic research to global communications and beyond.

Safety and Ethical Considerations

The development of HyperCLOVA X is firmly rooted in strict adherence to responsible AI practices. Through extensive safety evaluations and the establishment of the HyperCLOVA X Ethics Principles, the model exemplifies a commitment to generating content that is not only accurate but safe and free from harmful biases or toxic outputs. This proactive approach to AI safety encompasses red teaming exercises and the integration of feedback mechanisms to continually refine the model's alignment with ethical standards.

Conclusion and Future Directions

HyperCLOVA X sets a new benchmark for LLMs with its exceptional proficiency in the Korean language, thorough understanding of cultural nuances, and extensive multilingual capabilities. Going forward, the exploration of multimodality and model quantization remains a priority, aiming to further enhance the model's utility and accessibility. HyperCLOVA X's development trajectory reinforces the commitment to harnessing AI's power responsibly, fostering technological advancements that are inclusive, safe, and beneficial across diverse linguistic and cultural landscapes.

Youtube Logo Streamline Icon: https://streamlinehq.com