Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

Yi: Open Foundation Models by 01.AI (2403.04652v3)

Published 7 Mar 2024 in cs.CL and cs.AI

Abstract: We introduce the Yi model family, a series of language and multimodal models that demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and 34B pretrained LLMs, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-LLMs. Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our finetuned chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Chatbot Arena. Building upon our scalable super-computing infrastructure and the classical transformer architecture, we attribute the performance of Yi models primarily to its data quality resulting from our data-engineering efforts. For pretraining, we construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline. For finetuning, we polish a small scale (less than 10K) instruction dataset over multiple iterations such that every single instance has been verified directly by our machine learning engineers. For vision-language, we combine the chat LLM with a vision transformer encoder and train the model to align visual representations to the semantic space of the LLM. We further extend the context length to 200K through lightweight continual pretraining and demonstrate strong needle-in-a-haystack retrieval performance. We show that extending the depth of the pretrained checkpoint through continual pretraining further improves performance. We believe that given our current results, continuing to scale up model parameters using thoroughly optimized data will lead to even stronger frontier models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (95)
  1. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv preprint arXiv:2305.13245, 2023.
  2. Program Synthesis With lLarge Language Models. arXiv preprint arXiv:2108.07732, 2021.
  3. Qwen Technical Report. 09 2023. URL https://arxiv.org/pdf/2309.16609.pdf.
  4. PIQA: Reasoning about Physical Commonsense in Natural Language. ArXiv, abs/1911.11641, 2019. URL https://api.semanticscholar.org/CorpusID:208290939.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
  7. Evaluating Large Language Models Trained on Code. CoRR, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
  8. QuAC : Question Answering in Context, 2018.
  9. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  10. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions, 2019.
  11. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, 2018.
  12. Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168, 2021.
  13. Together Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  14. Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
  15. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  16. FiDO: Fusion-in-Decoder Optimized for Stronger Performance and Faster Inference. arXiv preprint arXiv:2212.08153, 2022.
  17. Deepseek llm: Scaling open-source language models with longtermism. 2024.
  18. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  19. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
  20. How abilities in large language models are affected by supervised fine-tuning data composition, 2023.
  21. Alpacafarm: A simulation framework for methods that learn from human feedback, 2023.
  22. Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171, 2024.
  23. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  24. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
  25. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  26. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
  27. Measuring Massive Multitask Language Understanding. CoRR, abs/2009.03300, 2020. URL https://arxiv.org/abs/2009.03300.
  28. Measuring Mathematical Problem Solving With the MATH Dataset. arXiv preprint arXiv:2103.03874, 2021.
  29. Scaling laws for autoregressive generative modeling. 2020.
  30. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  31. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
  32. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  33. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773.
  34. Neftune: Noisy embeddings improve instruction finetuning. arXiv preprint arXiv:2310.05914, 2023.
  35. Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023.
  36. Scaling laws for neural language models. 2020.
  37. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
  38. Solar 10.7b: Scaling large language models with simple yet effective depth up-scaling. 2023.
  39. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  40. SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. arXiv preprint arXiv:1808.06226, 2018.
  41. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv preprint arXiv:2309.06180, 2023.
  42. CMMLU: Measuring Massive Multitask Language Understanding in Chinese. arXiv preprint arXiv:2306.09212, 2023a.
  43. Sequence parallelism: Long sequence training from system perspective. arXiv preprint arXiv:2105.13120, 2021.
  44. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
  45. LinkSoul-AI. Chinese llava. https://github.com/LinkSoul-AI/Chinese-LLaVA, 2023.
  46. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  47. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  48. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. arXiv preprint arXiv:2312.15685, 2023c.
  49. #instag: Instruction tagging for analyzing supervised fine-tuning of large language models, 2023.
  50. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering, 2018.
  51. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019.
  52. CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages. arXiv preprint arXiv:2309.09400, 2023.
  53. OpenAI. ChatML, 2022. URL https://github.com/openai/openai-python/blob/e389823ba013a24b4c32ce38fa0bd87e6bccae94/chatml.md.
  54. Training Language Models to Follow Instructions with Human Feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  55. Keiran Paster. Testing language models on a held-out high school national finals exam. https://huggingface.co/datasets/keirp/hungarian_national_hs_finals_exam, 2023.
  56. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only, 2023.
  57. Efficiently Scaling Transformer Inference. Proceedings of Machine Learning and Systems, 5, 2023.
  58. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv preprint arXiv:2112.11446, 2021.
  59. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  60. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  61. SQuAD: 100,000+ Questions for Machine Comprehension of Text, 2016.
  62. WinoGrande: An Adversarial Winograd Schema Challenge at Scale, 2019.
  63. SocialIQA: Commonsense Reasoning about Social Interactions, 2019.
  64. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. arXiv preprint arXiv:2401.00448, 2023.
  65. Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems, 36, 2024.
  66. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  67. Noam Shazeer. Fast Transformer Decoding: One Write-Head is All You Need. arXiv preprint arXiv:1911.02150, 2019.
  68. Noam Shazeer. GLU Variants Improve Transformer. arXiv preprint arXiv:2002.05202, 2020.
  69. Byte Pair Encoding: A Text Compression Scheme That Accelerates Pattern Matching. Technical report, Technical Report DOI-TR-161, Department of Informatics, Kyushu University, 1999.
  70. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv preprint arXiv:1909.08053, 2019.
  71. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer, 2020.
  72. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models, 2023.
  73. Roformer: Enhanced Transformer with Rotary Position Embedding. arXiv preprint arXiv:2104.09864, 2021.
  74. Challenging Big-Bench Tasks and Whether Chain-of-Thought can Solve Them. arXiv preprint arXiv:2210.09261, 2022.
  75. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge, 2019.
  76. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  77. Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023b.
  78. Attention Is All You Need. Advances in Neural Information Processing Systems, 06 2017. URL https://arxiv.org/pdf/1706.03762.pdf.
  79. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. arXiv preprint arXiv:1911.00359, 11 2019a. URL https://arxiv.org/pdf/1911.00359.pdf.
  80. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. arXiv preprint arXiv:1911.00359, 2019b.
  81. Understanding int4 quantization for transformer models: Latency speedup, composability, and failure cases. arXiv preprint arXiv:2301.12017, 2023.
  82. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
  83. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  84. Baichuan 2: Open Large-scale Language Models. 09 2023. URL https://arxiv.org/pdf/2309.10305.pdf.
  85. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  86. Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022.
  87. Paraphrasing the original text makes high accuracy long-context qa. arXiv preprint arXiv:2312.11193, 2023.
  88. Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases. arXiv preprint arXiv:2303.14742, 2023.
  89. HellaSwag: Can a Machine Really Finish Your Sentence?, 2019.
  90. Evaluating the Performance of Large Language Models on GAOKAO Benchmark. arXiv preprint arXiv:2305.12474, 2023a.
  91. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023b.
  92. Take a step back: Evoking reasoning via abstraction in large language models, 2023a.
  93. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023b.
  94. Lima: Less is more for alignment, 2023.
  95. Visual7w: Grounded question answering in images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4995–5004, 2016.
Citations (383)

Summary

  • The paper details how rigorous pretraining on 3.1 trillion tokens and advanced filtering techniques significantly enhance performance across multilingual and multimodal tasks.
  • The paper outlines architectural innovations like Grouped-Query Attention and SwiGLU activation that balance computational efficiency with robust model capabilities.
  • The paper demonstrates a cost-effective, detail-oriented fine-tuning approach for chat applications while extending capabilities to support longer contexts and vision-language integration.

Insights and Developments in the Yi Model Series by 01.AI

Introduction to the Yi Model Series

The Yi model series, developed by 01.AI, marks a significant step forward in the field of LLMs. Comprising both 6B and 34B parameter models, the Yi family showcases its prowess across a multitude of tasks ranging from multi-modal challenges to chat-based applications. Built on a foundation of high-quality data engineering, the Yi models boast strong performance on benchmarks like MMLU, along with commendable human preference rates on evaluation platforms like AlpacaEval and Chatbot Arena. This summary explores the key aspects of the Yi model's development, including its data engineering strategies, model architecture, and the implications of its research findings.

Pretraining and Data Engineering

One of the standout features of the Yi models is the meticulous data engineering that underpins their training process. A colossal corpus of 3.1 trillion tokens, enriched in both English and Chinese, undergoes a rigorous cleaning and filtering process. This process includes sophisticated heuristic and learned filters, addressing the nuanced challenges, especially in Chinese content, to substantially improve data quality. The dedication to refining pretraining and fine-tuning data distinguishes the Yi models, demonstrating that quality trumps quantity in achieving superior model performance.

Architectural Choices

The Yi models adhere to a traditional transformer architecture, enhanced with tailored modifications such as Grouped-Query Attention and SwiGLU activation, which together ensure computational efficiency without sacrificing capability. The models are designed with an eye on scaling model parameters thoughtfully, balancing inference performance with serving costs for broader accessibility.

Fine-tuning for Chat Models

When it comes to fine-tuning for chat models, the Yi series deviates from large-scale instruction tuning, opting instead for a detail-oriented approach. Each piece of the fine-tuning dataset is meticulously crafted and iteratively polished, emphasizing data quality over sheer volume. This not only ensures the models' alignment with nuanced user preferences but also supports cost-efficient deployment on consumer-grade hardware through model and data quantization techniques.

Extending Capabilities

Beyond its foundational capabilities, the Yi model series is extended in three significant directions: adapting to 200K context length, integrating vision-language tasks, and exploring depth upscaling. These extensions are carefully engineered to unlock new dimensions of performance, from enhanced retrieval in elongated contexts to broadened multimodal understanding and improved depth capacity for nuanced reasoning.

Infrastructure and Safety Measures

Underpinning the development and deployment of the Yi models is a robust infrastructure that supports comprehensive scheduling, efficient training, and adaptive serving. Coupled with this technical backbone is a proactive approach to model safety, ensuring responsible use and alignment with ethical considerations through every stage of the model's lifecycle.

Evaluation and Community Impact

Extensive evaluation underscores the Yi models' competitive edge, not only in matching the performance of notable counterparts like GPT-3.5 but also in offering innovative solutions for deployment and user data privacy. The models stand as a testament to the potential of scaling up model parameters and optimizing data quality to push the boundaries of what LLMs can achieve.

Conclusion

The development and refinement of the Yi model series represent a confluence of rigorous data engineering, architectural innovation, and strategic capability extensions. Through detailed pretraining data processing, focused fine-tuning methodologies, and expansive infrastructure support, 01.AI positions the Yi models as powerful tools for research and application in the AI community. As we look to the future, the Yi series not only sets a new standard for LLM performance but also emphasizes the importance of ethical considerations and user-centric design in advancing artificial intelligence.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

  1. Yi: Open Foundation Models by 01.AI (205 points, 81 comments)
  2. Yi: Open Foundation Models (1 point, 0 comments)
Reddit Logo Streamline Icon: https://streamlinehq.com

Reddit