Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series (2405.19327v4)

Published 29 May 2024 in cs.CL, cs.AI, and cs.LG
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

Abstract: LLMs have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparable to existing closed-source LLMs. However, only the model's weights are provided with most details (e.g., intermediate checkpoints, pre-training corpus, and training code, etc.) being undisclosed. To improve the transparency of LLMs, the research community has formed to open-source truly open LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training corpus and training code) are being provided. These models have greatly advanced the scientific study of these large models including their strengths, weaknesses, biases and risks. However, we observe that the existing truly open LLMs on reasoning, knowledge, and coding tasks are still inferior to existing state-of-the-art LLMs with similar model sizes. To this end, we open-source MAP-Neo, a highly capable and transparent bilingual LLM with 7B parameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs. Moreover, we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided. Finally, we hope our MAP-Neo will enhance and strengthen the open research community and inspire more innovations and creativities to facilitate the further improvements of LLMs.

An Academic Analysis of MAP-Neo: A Fully Transparent Bilingual LLM

The paper under review introduces MAP-Neo, a novel bilingual LLM consisting of 7 billion parameters, which is fully open-sourced and transparent. The authors present a comprehensive overview of the entire pipeline used in developing MAP-Neo, ranging from data curation and training processes to the disclosure of model checkpoints and training frameworks. This essay provides an expert appraisal of the research findings and their implications for the field of NLP.

Overview of MAP-Neo

MAP-Neo stands out in the current landscape of LLMs due to its emphasis on transparency and full open-sourcing. The model addresses several critical gaps in the open-source community, particularly the need for high-performance models that are on par with proprietary solutions. Notably, the paper proclaims that MAP-Neo achieves competitive performance through a transparent development process that includes offering access to the pre-training corpus (Matrix Data Pile), detailed data curation pipelines, checkpoints, and an optimized training and evaluation framework.

Transparency and Open Source Commitment

One of the significant contributions of MAP-Neo, as discussed in the paper, is its unmatched level of transparency. Unlike many open-source models like LLaMA-3 and BLOOM, which often lack comprehensive details about their pre-training data and intermediate checkpoints, MAP-Neo provides complete transparency. This transparency extends to details such as the cleaned pre-training corpus, data cleaning pipeline, training code, intermediate checkpoints, and evaluation framework, making it a highly reproducible model.

Data Curation and Pre-Processing

The authors introduce the Matrix Data Pile, a large-scale pre-training corpus comprising 4.5 trillion tokens. The data curation process combines sophisticated data filtering, deduplication methods, and a robust document conversion pipeline. Given the critical role of high-quality data in LLM development, the paper's comprehensive data processing and cleaning methodologies ensure the reliability and effectiveness of the model. The authors also provide a detailed breakdown of the corpus composition, underlying the rigorous multi-stage data cleaning and quality assurance techniques employed.

Numerical Results and Model Performance

MAP-Neo demonstrates strong performance across multiple benchmarks, particularly in areas such as code generation, mathematical reasoning, and multilingual understanding. Key numerical results highlighted in the paper include a HumanEval score of 23.8 and a GSM8K score of 53.68, which places MAP-Neo close to industry-level models like LLaMA3-8B and Mistral-7B. The robust performance is attributed to the model's high-quality pre-training data and optimized training framework.

Implications and Future Directions

The introduction of MAP-Neo has several implications for both practical applications and future research. From a practical standpoint, the full transparency offered by MAP-Neo lowers the barrier for organizations and researchers to understand and leverage advanced LLM technologies without being constrained by proprietary limitations. The detailed disclosure of the model's training process and data curation paves the way for enhanced reproducibility and independent validation in the research community.

Theoretically, MAP-Neo sets a new standard for developing high-performance, transparent LLMs. This transparency can drive further innovations in NLP by enabling unbiased analyses of model behavior, identification of biases, and understanding of potential risks. The comprehensive release of the pre-training corpus and frameworks can also inspire new methodologies and optimizations in the field.

Conclusion

In conclusion, MAP-Neo represents a significant advancement in the development of open-source, transparent LLMs. Its bilingual capabilities, combined with fully disclosed training and evaluation pipelines, provide a valuable asset for the research community. The model not only demonstrates strong performance across various tasks but also highlights the importance of transparency and reproducibility in advancing NLP research. As the field continues to evolve, models like MAP-Neo will play a crucial role in democratizing access to LLM technologies and driving forward innovative research in artificial intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (128)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. The de-democratization of ai: Deep learning and the compute divide in artificial intelligence research. arXiv preprint arXiv:2010.15581, 2020.
  3. AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
  4. AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 2024.
  5. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  6. Llemma: An open language model for mathematics.
  7. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  8. Cosmopedia, 2024. URL https://huggingface.co/datasets/HuggingFaceTB/cosmopedia.
  9. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR, 2023.
  10. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  7432–7439, 2020.
  11. Nougat: Neural optical understanding for academic documents, 2023.
  12. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952. ISSN 00063444. URL http://www.jstor.org/stable/2334029.
  13. Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pp.  21–29. IEEE, 1997.
  14. Chinesewebtext: Large-scale high-quality chinese web text extracted with effective evaluation model, 2023a.
  15. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  16. Theoremqa: A theorem-driven question answering dataset. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023b.
  17. Agent-flan: Designing data and methods of effective agent tuning for large language models. arXiv preprint arXiv:2403.12881, 2024.
  18. Language models as science tutors. arXiv preprint arXiv: 2402.11111, 2024.
  19. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  20. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
  21. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  22. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  23. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  24. Open innovation and within-industry diversification in small and medium enterprises: The case of open source software firms. Research policy, 43(5):891–902, 2014.
  25. Together Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  26. OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
  27. Data colonialism: Rethinking big data’s relation to the contemporary subject. Television & New Media, 20(4):336–349, 2019.
  28. DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024. URL https://github.com/deepseek-ai/DeepSeek-LLM.
  29. Composerx: Multi-agent symbolic music composition with llms. arXiv preprint arXiv:2404.18081, 2024.
  30. Chinese tiny llm: Pretraining a chinese-centric large language model, 2024.
  31. Alpacafarm: A simulation framework for methods that learn from human feedback, 2023.
  32. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.
  33. Identifying and characterizing highly similar notes in big clinical note datasets. Journal of biomedical informatics, 82:63–69, 2018.
  34. Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
  35. Similarity search in high dimensions via hashing. In Vldb, volume 99, pp.  518–529, 1999.
  36. Olmo: Accelerating the science of language models. arXiv preprint arXiv:2402.00838, 2024.
  37. Deduplication of scholarly documents using locality sensitive hashing and word embeddings. 2020.
  38. Wanjuan: A comprehensive multimodal dataset for advancing english and chinese large models, 2023.
  39. Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset, 2022. URL https://arxiv.org/abs/2207.00220.
  40. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  41. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  42. Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487, 2022.
  43. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  44. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024.
  45. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pp.  4083–4091, 2022.
  46. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36, 2024.
  47. Paul Jaccard. The distribution of the flora in the alpine zone. 1. New phytologist, 11(2):37–50, 1912.
  48. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  49. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
  50. Fasttext.zip: Compressing text classification models. arXiv: Computation and Language,arXiv: Computation and Language, Nov 2016.
  51. Jean Kaddour. The minipile challenge for data-efficient language models. arXiv preprint arXiv:2304.08442, 2023.
  52. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  53. Connected components in mapreduce and beyond. In Proceedings of the ACM Symposium on Cloud Computing, pp.  1–13, 2014.
  54. The stack: 3 tb of permissively licensed source code. Preprint, 2022.
  55. Hdltex: Hierarchical deep learning for text classification. In Machine Learning and Applications (ICMLA), 2017 16th IEEE International Conference on. IEEE, 2017.
  56. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
  57. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  58. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021.
  59. Pp-structurev2: A stronger document analysis system. arXiv preprint arXiv:2210.05391, 2022.
  60. Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023a.
  61. From live data to high-quality benchmarks: The arena-hard pipeline, April 2024. URL https://lmsys.org/blog/2024-04-19-arena-hard/.
  62. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
  63. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=1qvx610Cu7.
  64. Storm-7b, April 2024. URL https://huggingface.co/jieliu/Storm-7B.
  65. Alignbench: Benchmarking chinese alignment of large language models, 2023b.
  66. Llm360: Towards fully transparent open-source llms, 2023c.
  67. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
  68. Starcoder 2 and the stack v2: The next generation, 2024.
  69. Yayi 2: Multilingual open-source large language models. arXiv preprint arXiv:2312.14862, 2023.
  70. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  71. Nam Pham. tiny-strange-textbooks (revision 6f304f1), 2024. URL https://huggingface.co/datasets/nampdn-ai/tiny-strange-textbooks.
  72. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. arXiv preprint arXiv:2309.09400, 2023.
  73. Openwebmath: An open dataset of high-quality mathematical web text, 2023.
  74. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023. URL https://arxiv.org/abs/2306.01116.
  75. Mupt: A generative symbolic music pretrained transformer. arXiv preprint arXiv:2404.06393, 2024.
  76. Scaling language models: Methods, analysis & insights from training gopher, 2022.
  77. Direct preference optimization: Your language model is secretly a reward model, 2023.
  78. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv e-prints, art. arXiv:1910.10683, October 2019. doi: 10.48550/arXiv.1910.10683.
  79. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
  80. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  81. Paola Ricaurte. Data epistemologies, the coloniality of power, and resistance. Television & New Media, 20(4):350–365, 2019.
  82. Ronsor. Bigknow2022: Bringing language models up to speed. https://github.com/RyokoAI/BigKnow2022, 2023.
  83. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  84. Design and implementation of the sun network filesystem. In Proceedings of the summer 1985 USENIX conference, pp.  119–130, 1985.
  85. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
  86. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. working paper or preprint, November 2023. URL https://inria.hal.science/hal-03850124.
  87. Proximal policy optimization algorithms, 2017.
  88. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
  89. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
  90. BIGPATENT: A large-scale dataset for abstractive and coherent summarization. CoRR, abs/1906.03741, 2019. URL http://arxiv.org/abs/1906.03741.
  91. Democratizing llms: An exploration of cost-performance trade-offs in self-refined open-source models. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  92. Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019.
  93. Noam Shazeer. Glu variants improve transformer, 2020.
  94. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
  95. Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint, 2024. URL https://arxiv.org/abs/2402.00159.
  96. Open innovation practices in smes and large enterprises. Small business economics, 41:537–562, 2013.
  97. Roformer: Enhanced transformer with rotary position embedding, 2023.
  98. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.
  99. Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023. URL https://huggingface.co/datasets/teknium/OpenHermes-2.5.
  100. Culturay: A large cleaned multilingual dataset of 75 languages, 2024.
  101. Llama: Open and efficient foundation language models. ARXIV, 2023a.
  102. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv: 2307.09288, 2023b.
  103. Attention is all you need, 2023.
  104. Not just bigger: Towards better-quality web corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7), pp.  44–52, 2012.
  105. Weaver: Foundation models for creative writing. arXiv preprint arXiv: 2401.17268, 2024a.
  106. Mmlu-pro: Towards more robust and challenging multi-task language understanding evaluation. Manuscript in preparation, 2024b.
  107. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv preprint arXiv: 2310.00746, 2023.
  108. Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109, 2023a.
  109. Skywork: A more open bilingual foundation model, 2023b.
  110. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pp.  1–10, 2022.
  111. Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682, 2023.
  112. Llm agents for psychology: A study on gamified assessments. arXiv preprint arXiv: 2402.12326, 2024.
  113. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
  114. Llasmol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset. arXiv preprint arXiv:2402.09391, 2024.
  115. Chatmusician: Understanding and generating music intrinsically with llm. arXiv preprint arXiv:2402.16153, 2024.
  116. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
  117. Mammoth2: Scaling instructions from the web. arXiv preprint arXiv:2405.03548, 2024.
  118. Resilient distributed datasets: A {{\{{Fault-Tolerant}}\}} abstraction for {{\{{In-Memory}}\}} cluster computing. In 9th USENIX symposium on networked systems design and implementation (NSDI 12), pp.  15–28, 2012.
  119. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  120. Root mean square layer normalization, 2019.
  121. Chinese open instruction generalist: A preliminary release. arXiv preprint arXiv:2304.07987, 2023a.
  122. Don’t trust chatgpt when your question is not in english: A study of multilingual abilities and types of llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  7915–7927, 2023b.
  123. Automathtext: Autonomous data selection with language models for mathematical texts. arXiv preprint arXiv:2402.07625, 2024.
  124. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024a.
  125. Opencodeinterpreter: Integrating code generation with execution and refinement. arXiv preprint arXiv:2402.14658, 2024b.
  126. Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023.
  127. Structlm: Towards building generalist models for structured knowledge grounding, 2024a.
  128. Chuxin: 1.6 b technical report. arXiv preprint arXiv:2405.04828, 2024b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (45)
  1. Ge Zhang (170 papers)
  2. Scott Qu (1 paper)
  3. Jiaheng Liu (100 papers)
  4. Chenchen Zhang (19 papers)
  5. Chenghua Lin (127 papers)
  6. Chou Leuang Yu (1 paper)
  7. Danny Pan (2 papers)
  8. Esther Cheng (2 papers)
  9. Jie Liu (492 papers)
  10. Qunshu Lin (11 papers)
  11. Raven Yuan (1 paper)
  12. Tuney Zheng (7 papers)
  13. Wei Pang (60 papers)
  14. Xinrun Du (23 papers)
  15. Yiming Liang (22 papers)
  16. Yinghao Ma (24 papers)
  17. Yizhi Li (43 papers)
  18. Ziyang Ma (73 papers)
  19. Bill Lin (23 papers)
  20. Emmanouil Benetos (89 papers)
Citations (26)