Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Baichuan 2: Open Large-scale Language Models (2309.10305v2)

Published 19 Sep 2023 in cs.CL
Baichuan 2: Open Large-scale Language Models

Abstract: LLMs have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering. However, most powerful LLMs are closed-source or limited in their capability for languages other than English. In this technical report, we present Baichuan 2, a series of large-scale multilingual LLMs containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval. Furthermore, Baichuan 2 excels in vertical domains such as medicine and law. We will release all pre-training model checkpoints to benefit the research community in better understanding the training dynamics of Baichuan 2.

Overview of "Baichuan 2: Open Large-scale LLMs"

The paper introduces Baichuan 2, a set of open-source, multilingual LLMs with parameter counts of 7 billion and 13 billion. These models are unique due to their training on 2.6 trillion tokens, which is notably extensive compared to similar models. Baichuan 2 is designed to rival or exceed comparable open-source models across various benchmarks, while also being focused on non-English languages, specifically Chinese.

Key Contributions

Baichuan 2 significantly enriches the landscape of open-source LLMs in several critical ways:

  • Extensive Training Dataset: The models were trained on a colossal dataset of 2.6 trillion tokens, emphasizing a multilingual corpus. This dataset surpasses those used by previous iterations such as Baichuan 1.
  • Benchmark Performance: On benchmarks including MMLU, CMMLU, GSM8K, and HumanEval, Baichuan 2 demonstrates superior performance compared to open-source models of similar size. In particular, it shows substantial improvement in solving mathematics and coding-related tasks.
  • Domain Specialization: Baichuan 2 has demonstrated strong results in specialized domains like medicine and law, making it an ideal foundation for further domain-specific optimizations.
  • Open Model Release: All pre-training checkpoints are made available, facilitating deeper insights into the training dynamics of the model, which can be an invaluable resource for research and development.

Technical Modifications

Baichuan 2 incorporates several architectural enhancements and training optimizations:

  • Tokenizer and Model Architecture: A larger vocabulary of 125,696 tokens was employed using BPE, which improves upon previous versions by better balancing compression rate and computational efficiency.
  • Positional Embeddings and Optimizations: The model utilizes different types of positional embeddings—Rotary Positional Embedding and ALiBi. It also integrates techniques like memory-efficient attention and uses the SwiGLU activation function to enhance training robustness and performance.
  • NormHead and Max-z Loss: These methodologies were implemented to stabilize training by normalizing output embeddings and mitigating the growth of logits, respectively, leading to improved inference robustness.

Safety and Alignment

Baichuan 2 incorporates a detailed alignment procedure, resulting in chat-specific models that are enhanced for dialogue comprehension and instruction-following abilities. The alignment process includes:

  • Supervised and Reinforcement Learning: Human feedback was employed for initial supervised fine-tuning, while reinforcement learning further refined the responses. The paper outlines the use of reward models to optimize response generation.
  • Safe Model Development: Through various safety protocols, from data filtering to the reinforcement learning stage, the model emphasizes reducing harmful outputs. Safety evaluations indicate significant improvements without compromising the helpfulness of the AI.

Implications and Future Directions

This work contributes to the ongoing trend toward open and transparent AI development, emphasizing multilingual capabilities, data efficiency, and domain specialization. The release of intermediary checkpoints is particularly valuable for ongoing research in understanding and improving training dynamics. Future developments could enhance Baichuan 2's safety mechanisms, expand its multilingual scope, and refine its performance further in specialized domains.

Overall, Baichuan 2 represents a substantial progression for open-source LLMs, broadening the accessibility and applicability of AI technologies beyond predominantly English-centric models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. GitHub.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  3. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  4. Layer normalization. arXiv preprint arXiv:1607.06450.
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  6. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  7. Baichuan. 2023a. A 13b large language model developed by baichuan intelligent technology.
  8. Baichuan. 2023b. A large-scale 7b pretraining language model developed by baichuan-inc.
  9. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  10. Pythia: A suite for analyzing large language models across training and scaling. ArXiv, abs/2304.01373.
  11. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  12. Evaluating large language models trained on code. CoRR, abs/2107.03374.
  13. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  14. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  15. Claude. 2023. Conversation with Claude AI assistant.
  16. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  17. Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177.
  18. Tri Dao. 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning.
  19. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems.
  20. Language modeling with gated convolutional networks. In International conference on machine learning, pages 933–941. PMLR.
  21. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270.
  22. A framework for few-shot language model evaluation.
  23. The flores-101 evaluation benchmark for low-resource and multilingual machine translation.
  24. Two new evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english.
  25. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509.
  26. Measuring massive multitask language understanding. In ICLR. OpenReview.net.
  27. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
  28. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701.
  29. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  30. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
  31. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.
  32. Osdp: Optimal sharded data parallel for distributed deep learning. arXiv preprint arXiv:2209.13258.
  33. Normsoftmax: Normalizing the input of softmax to accelerate and stabilize training. In 2023 IEEE International Conference on Omni-layer Intelligent Systems (COINS), pages 1–6. IEEE.
  34. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
  35. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  36. Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
  37. Cmmlu: Measuring massive multitask language understanding in chinese.
  38. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  39. MosaicML. 2023. Introducing mpt-7b: A new standard for open-source, commercially usable llms.
  40. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15.
  41. Tsplit: Fine-grained gpu memory management for efficient dnn training via tensor splitting. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), pages 2615–2628. IEEE.
  42. James Cross Onur Çelebi Maha Elbayad Kenneth Heafield Kevin Heffernan Elahe Kalbassi Janice Lam Daniel Licht Jean Maillard Anna Sun Skyler Wang Guillaume Wenzek Al Youngblood Bapi Akula Loic Barrault Gabriel Mejia Gonzalez Prangthip Hansanti John Hoffman Semarley Jarrett Kaushik Ram Sadagopan Dirk Rowe Shannon Spruit Chau Tran Pierre Andrews Necip Fazil Ayan Shruti Bhosale Sergey Edunov Angela Fan Cynthia Gao Vedanuj Goswami Francisco Guzmán Philipp Koehn Alexandre Mourachko Christophe Ropers Safiyyah Saleem Holger Schwenk Jeff Wang NLLB Team, Marta R. Costa-jussà. 2022. No language left behind: Scaling human-centered machine translation.
  43. OpenAI. 2022. Introducing chatgpt. Blog post openai.com/blog/chatgpt.
  44. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  45. OpenCompass. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/InternLM/OpenCompass.
  46. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  47. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pages 248–260. PMLR.
  48. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
  49. Deep contextualized word representations. corr abs/1802.05365 (2018). arXiv preprint arXiv:1802.05365.
  50. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409.
  51. Markus N Rabe and Charles Staats. 2021. Self-attention does not need o⁢(n2)𝑜superscript𝑛2o(n^{2})italic_o ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory. arXiv preprint arXiv:2112.05682.
  52. Improving language understanding by generative pre-training.
  53. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  54. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
  55. Bloom: A 176b-parameter open-access multilingual language model. ArXiv, abs/2211.05100.
  56. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  57. Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202.
  58. Language models are multilingual chain-of-thought reasoners. CoRR, abs/2210.03057.
  59. Byte pair encoding: A text compression scheme that accelerates pattern matching.
  60. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  61. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864.
  62. Moss: Training conversational language models from synthetic data.
  63. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  64. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7.
  65. Galactica: A large language model for science. CoRR, abs/2211.09085.
  66. Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems, 35:38274–38290.
  67. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971.
  68. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  69. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  70. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
  71. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  72. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966.
  73. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524–10533. PMLR.
  74. Wizardlm: Empowering large language models to follow complex instructions.
  75. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  76. Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32.
  77. Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068.
  78. Evaluating the performance of large language models on gaokao benchmark.
  79. Jec-qa: A legal-domain question answering dataset. In Proceedings of AAAI.
  80. Agieval: A human-centric benchmark for evaluating foundation models.
  81. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (55)
  1. Aiyuan Yang (11 papers)
  2. Bin Xiao (93 papers)
  3. Bingning Wang (29 papers)
  4. Borong Zhang (12 papers)
  5. Ce Bian (5 papers)
  6. Chao Yin (32 papers)
  7. Chenxu Lv (4 papers)
  8. Da Pan (13 papers)
  9. Dian Wang (34 papers)
  10. Dong Yan (51 papers)
  11. Fan Yang (877 papers)
  12. Fei Deng (35 papers)
  13. Feng Wang (408 papers)
  14. Feng Liu (1212 papers)
  15. Guangwei Ai (2 papers)
  16. Guosheng Dong (13 papers)
  17. Haizhou Zhao (9 papers)
  18. Hang Xu (204 papers)
  19. Haoze Sun (21 papers)
  20. Hongda Zhang (6 papers)
Citations (584)
Youtube Logo Streamline Icon: https://streamlinehq.com