Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

YAYI 2: Multilingual Open-Source Large Language Models (2312.14862v1)

Published 22 Dec 2023 in cs.CL and cs.AI
YAYI 2: Multilingual Open-Source Large Language Models

Abstract: As the latest advancements in natural language processing, LLMs have achieved human-level language understanding and generation abilities in many real-world tasks, and even have been regarded as a potential path to the artificial general intelligence. To better facilitate research on LLMs, many open-source LLMs, such as Llama 2 and Falcon, have recently been proposed and gained comparable performances to proprietary models. However, these models are primarily designed for English scenarios and exhibit poor performances in Chinese contexts. In this technical report, we propose YAYI 2, including both base and chat models, with 30 billion parameters. YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback. Extensive experiments on multiple benchmarks, such as MMLU and CMMLU, consistently demonstrate that the proposed YAYI 2 outperforms other similar sized open-source models.

Introduction

LLMs stand at the forefront of AI, demonstrating human-like proficiency in understanding and generating language. These models serve a range of purposes, from aiding in creative writing to summarizing extensive texts and planning activities, and even have the potential to pave the way towards artificial general intelligence (AGI). However, LLMs typically require vast amounts of data and extensive computing resources. While proprietary models like ChatGPT have made headlines, there's an ongoing effort to create open-source alternatives that could democratize access to this powerful technology. One significant limitation of existing LLMs is their focus on English, leaving a gap in performance for other languages such as Chinese.

Pre-Training

YAYI 2 is the model under consideration, with both base and chat models of 30 billion parameters each, pre-trained on a corpus with multilingual content, which particularly improves performance in Chinese contexts. The developers curated a massive data set of over 240 terabytes, 41.5% of which is Chinese, sourced from diverse content such as news and Wikipedia. Special care was taken to design a rigorous processing pipeline, employing normalization, heuristic cleaning, multi-level deduplication, and toxicity filtering to ensure data quality and safe outputs from the models. Advanced techniques like FlashAttention 2 and multi-query attention were used to increase the speed of training and inference.

Alignment

To fine-tune the base models, a process involving millions of instruction-output pairs and reinforcement learning from human feedback was employed. This was crucial to imbue the models with the ability to handle long instructions and multi-turn conversations. The training data for this aligned fine-tuning covered a vast array of tasks, evaluated based on several dimensions, stressing balance and high quality. Additionally, the model is designed to handle various domain tasks, assisting the model's efficacy in real-world business scenarios.

Evaluations

The YAYI 2 base model benchmarking reveals it outperforms several similar-sized open-source models across standard benchmarks for knowledge and language understanding, mathematical reasoning, and programming. Particularly noteworthy is its performance on benchmarks involving multilingual capabilities and understanding contextually relevant information. While the YAYI 2 model demonstrates remarkable capabilities in the creation and usage of language, users are cautioned to review its outputs, especially in sensitive scenarios, to avoid the propagation of potentially harmful content.

In conclusion, YAYI 2 is a multilingual, open-source LLM that offers significant advancements over its open-source counterparts, especially in the Chinese language context. The model was trained using innovative techniques for efficiency and human-like understanding, and it performed impressively in benchmarks that test a variety of capabilities essential to AGI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. 01-AI. 2023. Yi: A series of large language models trained from scratch by developers at 01-ai. https://github.com/01-ai/Yi.
  2. Falcon-40B: An open large language model with state-of-the-art performance. https://huggingface.co/tiiuae/falcon-40b.
  3. Palm 2 technical report.
  4. Program synthesis with large language models.
  5. Layer normalization.
  6. BAAI. 2023. Aquila2 series proposed by BAAI. https://github.com/FlagAI-Open/Aquila2.
  7. Qwen technical report.
  8. Constitutional AI: Harmlessness from AI feedback.
  9. Baichuan. 2023. A large-scale 7B pretraining language model developed by baichuan Inc. https://github.com/baichuan-inc/Baichuan-7B.
  10. Language models are few-shot learners. In Advances in Neural Information Processing Systems.
  11. Evaluating large language models trained on code.
  12. Training verifiers to solve math word problems.
  13. Together Computer. 2023. RedPajama: An open dataset for training large language models. https://github.com/togethercomputer/RedPajama-Data.
  14. Efficient and effective text encoding for Chinese LLaMA and Alpaca.
  15. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.
  16. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  17. Measuring mathematical problem solving with the MATH dataset. In Conference on Neural Information Processing Systems Track on Datasets and Benchmarks.
  18. C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models.
  19. InternLM. 2023. InternLM: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
  20. Challenges and applications of large language models.
  21. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  22. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71.
  23. xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers.
  24. CMMLU: Measuring massive multitask language understanding in chinese.
  25. Let’s verify step by step.
  26. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. In International Conference on Learning Representations.
  27. MosaicML et al. 2023. MPT-30B: Raising the bar for open-source foundation models. https://www.mosaicml.com/blog/mpt-30b.
  28. OpenCompass. 2023. OpenCompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass.
  29. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. In Workshop on the Challenges in the Management of Large Corpora.
  30. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems.
  31. The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data, and web data only.
  32. Yarn: Efficient context window extension of large language models.
  33. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations.
  34. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
  35. Proximal policy optimization algorithms.
  36. Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need.
  37. Noam Shazeer. 2020. Glu variants improve transformer.
  38. SlimPajama-DC: Understanding data combinations for LLM training.
  39. Byte pair encoding: A text compression scheme that accelerates pattern matching. Technical Report DOI-TR-161, Department of Informatics, Kyushu University.
  40. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, page 127063.
  41. Challenging big-bench tasks and whether chain-of-thought can solve them.
  42. LLaMA: Open and efficient foundation language models.
  43. LLaMA 2: Open foundation and fine-tuned chat models.
  44. Attention is all you need. In Advances in Neural Information Processing Systems.
  45. Bloom: A 176B-parameter open-access multilingual language model.
  46. XVERSE. 2023. XVERSE-13B: A multilingual large language model developed by XVERSE Technology Inc. https://github.com/xverse-ai/XVERSE-13B.
  47. Baichuan 2: Open large-scale language models.
  48. GLM-130B: An open bilingual pre-trained model. In International Conference on Learning Representations.
  49. Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. In Advances in Neural Information Processing Systems.
  50. Evaluating the performance of large language models on GAOKAO benchmark.
  51. AGIEval: A human-centric benchmark for evaluating foundation models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (53)
  1. Yin Luo (5 papers)
  2. Qingchao Kong (7 papers)
  3. Nan Xu (83 papers)
  4. Jia Cao (1 paper)
  5. Bao Hao (1 paper)
  6. Baoyu Qu (1 paper)
  7. Bo Chen (309 papers)
  8. Chao Zhu (51 papers)
  9. Chenyang Zhao (39 papers)
  10. Donglei Zhang (1 paper)
  11. Fan Feng (50 papers)
  12. Feifei Zhao (29 papers)
  13. Hailong Sun (23 papers)
  14. Hanxuan Yang (6 papers)
  15. Haojun Pan (1 paper)
  16. Hongyu Liu (208 papers)
  17. Jianbin Guo (2 papers)
  18. Jiangtao Du (1 paper)
  19. Jingyi Wang (105 papers)
  20. Junfeng Li (47 papers)
Citations (5)