Emergent Mind

HyperCLOVA X Technical Report

(2404.01954)
Published Apr 2, 2024 in cs.CL and cs.AI

Abstract

We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in developing their sovereign LLMs.
HyperCLOVA X outperforms Korean models; rivals LLaMA 2 in English on diverse benchmarks.

Overview

  • HyperCLOVA X introduces advanced Korean-centric LLMs, HCX-L and HCX-S, with significant improvements in multilingual capacity, training on a mix of Korean, English, and programming languages.

  • Innovations include pre-normalization, grouped-query attention mechanisms, and rotary position embeddings, contributing to its enhanced performance in content understanding and generation.

  • The model showcases exceptional ability in Korean and English benchmarks, and extends its capabilities to remarkable performances in machine translation and cross-lingual tasks involving Japanese and Chinese.

  • Strict safety and ethical guidelines are followed in development, alongside a focus on responsible AI practices, ensuring safe, bias-free content generation.

Training Details

HyperCLOVA X encompasses HCX-L and HCX-S models, marking a significant leap in language models concentrated on the Korean language and culture. This advancement is achieved through an innovative training methodology, starting with an evenly distributed mix of Korean, English, and programming language data. A notable distinction lies in the adoption of pre-normalization and grouped-query attention mechanisms alongside the rotary position embeddings, enhancing model robustness and length handling capabilities. The pretraining corpus, reflecting a meticulous compilation process, ensures a balanced representation of high-quality, diverse content excluding low-quality, repetitive, or sensitive information. This comprehensive approach not only refines the quality of training data but significantly contributes to the model's performance in understanding and generating content in both Korean and English.

Benchmark Performance

HyperCLOVA X's prowess is evident across a range of benchmarks designed to evaluate reasoning, knowledge encapsulation, and language understanding capabilities. Distinguished performance on comprehensive Korean benchmarks underscores its profound comprehension of Korean cultural and societal nuances. When juxtaposed with models focusing either on Korean or general foundations, HyperCLOVA X demonstrates noteworthy superiority, particularly in tasks requiring nuanced understanding and knowledge application. Its performance on core English-language benchmarks further reinforces its bilingual capabilities, facilitating cross-cultural exchange and understanding.

Multilingual Abilities

The inherent bilingual design is extended to accommodate multilingualism, a feat highlighted through machine translation and cross-lingual inference tasks. HyperCLOVA X exemplifies state-of-the-art machine translation performance between Korean and other widely used languages in Korea, including Japanese and Chinese. This attribute is paramount in environments demanding fluency across multiple languages, offering substantial assistance in real-world application scenarios ranging from academic research to global communications and beyond.

Safety and Ethical Considerations

The development of HyperCLOVA X is firmly rooted in strict adherence to responsible AI practices. Through extensive safety evaluations and the establishment of the HyperCLOVA X Ethics Principles, the model exemplifies a commitment to generating content that is not only accurate but safe and free from harmful biases or toxic outputs. This proactive approach to AI safety encompasses red teaming exercises and the integration of feedback mechanisms to continually refine the model's alignment with ethical standards.

Conclusion and Future Directions

HyperCLOVA X sets a new benchmark for LLMs with its exceptional proficiency in the Korean language, thorough understanding of cultural nuances, and extensive multilingual capabilities. Going forward, the exploration of multimodality and model quantization remains a priority, aiming to further enhance the model's utility and accessibility. HyperCLOVA X's development trajectory reinforces the commitment to harnessing AI's power responsibly, fostering technological advancements that are inclusive, safe, and beneficial across diverse linguistic and cultural landscapes.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

YouTube
References
  1. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus
  2. GPT-4 Technical Report
  3. Do all languages cost the same? tokenization in the era of commercial language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9904–9923.
  4. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
  5. The Falcon Series of Open Language Models
  6. A General Language Assistant as a Laboratory for Alignment
  7. Program Synthesis with Large Language Models
  8. Program synthesis with LLMs. https://github.com/google-research/google-research/blob/master/mbpp/README.md. Accessed: 2024-03-25.

  9. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  10. Efficient Training of Language Models to Fill in the Middle
  11. Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439.
  12. Findings of the WMT 2023 shared task on quality estimation. In Proceedings of the Eighth Conference on Machine Translation, pages 629–653, Singapore. Association for Computational Linguistics.
  13. R. A. Bradley and M. E. Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
  14. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  15. Evaluating Large Language Models Trained on Code
  16. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
  17. Training Verifiers to Solve Math Word Problems
  18. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
  19. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems, volume 35, pages 16344–16359. Curran Associates, Inc.
  20. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 862–872, New York, NY, USA. Association for Computing Machinery.
  21. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
  22. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
  23. A framework for few-shot language model evaluation
  24. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online. Association for Computational Linguistics.
  25. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.
  26. xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection
  27. Measuring Massive Multitask Language Understanding
  28. Measuring mathematical problem solving with the math dataset. NeurIPS.
  29. The curious case of neural text degeneration. In International Conference on Learning Representations.
  30. Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
  31. KOLD: Korean offensive language dataset. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10818–10833, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  32. Mistral 7b
  33. KoBBQ: Korean Bias Benchmark for Question Answering
  34. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
  35. What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers
  36. SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling
  37. CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean
  38. Efficient and Effective Vocabulary Expansion Towards Multilingual Large Language Models
  39. A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models
  40. The Stack: 3 TB of permissively licensed source code
  41. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  42. Lee Junbum. 2024. Yi-ko-6b (revision 205083a).
  43. Scalable agent alignment via reward modeling: a research direction
  44. Holistic Evaluation of Language Models
  45. Common sense beyond English: Evaluating and improving multilingual language models for commonsense reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1274–1287, Online. Association for Computational Linguistics.
  46. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
  47. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
  48. Orca: Progressive learning from complex explanation traces of gpt-4
  49. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  50. Bbq: A hand-built bias benchmark for question answering
  51. Stabilizing RLHF through Advantage Model and Selective Rehearsal
  52. True few-shot learning with language models. Advances in neural information processing systems, 34:11054–11070.
  53. Language model tokenizers introduce unfairness between languages. Advances in Neural Information Processing Systems, 36.
  54. Scaling Language Models: Methods, Analysis & Insights from Training Gopher
  55. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106.
  56. Proximal Policy Optimization Algorithms
  57. Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting
  58. Neural Machine Translation of Rare Words with Subword Units
  59. Multilingual Instruction Tuning With Just a Pinch of Multilinguality
  60. A Long Way to Go: Investigating Length Correlations in RLHF
  61. KMMLU: Measuring Massive Multitask Language Understanding in Korean
  62. HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models
  63. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  64. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  65. Learning to summarize from human feedback
  66. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  67. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
  68. Nsml: A machine learning platform that enables you to focus on your models. arXiv:1712.05902.
  69. Challenging BIG-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, Toronto, Canada. Association for Computational Linguistics.
  70. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  71. Gemma: Open Models Based on Gemini Research and Technology
  72. LLaMA: Open and Efficient Foundation Language Models
  73. Llama 2: Open foundation and fine-tuned chat models
  74. Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
  75. Attention is all you need. Advances in neural information processing systems, 30.
  76. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
  77. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
  78. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
  79. Neural text generation with unlikelihood training. In International Conference on Learning Representations.
  80. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
  81. On Diverse Preferences for Large Language Model Alignment
  82. Judging llm-as-a-judge with mt-bench and chatbot arena
  83. Secrets of RLHF in Large Language Models Part I: PPO
  84. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
  85. LIMA: Less is more for alignment. In Thirty-seventh Conference on Neural Information Processing Systems.

Show All 85

Test Your Knowledge

You answered out of questions correctly.

Well done!