Yi: Open Foundation Models by 01.AI (2403.04652v3)

Published 7 Mar 2024 in cs.CL and cs.AI

Abstract: We introduce the Yi model family, a series of language and multimodal models that demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and 34B pretrained LLMs, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-LLMs. Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our finetuned chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Chatbot Arena. Building upon our scalable super-computing infrastructure and the classical transformer architecture, we attribute the performance of Yi models primarily to its data quality resulting from our data-engineering efforts. For pretraining, we construct 3.1 trillion tokens of English and Chinese corpora using a cascaded data deduplication and quality filtering pipeline. For finetuning, we polish a small scale (less than 10K) instruction dataset over multiple iterations such that every single instance has been verified directly by our machine learning engineers. For vision-language, we combine the chat LLM with a vision transformer encoder and train the model to align visual representations to the semantic space of the LLM. We further extend the context length to 200K through lightweight continual pretraining and demonstrate strong needle-in-a-haystack retrieval performance. We show that extending the depth of the pretrained checkpoint through continual pretraining further improves performance. We believe that given our current results, continuing to scale up model parameters using thoroughly optimized data will lead to even stronger frontier models.

References (95)

Citations (383)

View on Semantic Scholar

Summary

The paper details how rigorous pretraining on 3.1 trillion tokens and advanced filtering techniques significantly enhance performance across multilingual and multimodal tasks.
The paper outlines architectural innovations like Grouped-Query Attention and SwiGLU activation that balance computational efficiency with robust model capabilities.
The paper demonstrates a cost-effective, detail-oriented fine-tuning approach for chat applications while extending capabilities to support longer contexts and vision-language integration.

Insights and Developments in the Yi Model Series by 01.AI

Introduction to the Yi Model Series

The Yi model series, developed by 01.AI, marks a significant step forward in the field of LLMs. Comprising both 6B and 34B parameter models, the Yi family showcases its prowess across a multitude of tasks ranging from multi-modal challenges to chat-based applications. Built on a foundation of high-quality data engineering, the Yi models boast strong performance on benchmarks like MMLU, along with commendable human preference rates on evaluation platforms like AlpacaEval and Chatbot Arena. This summary explores the key aspects of the Yi model's development, including its data engineering strategies, model architecture, and the implications of its research findings.

Pretraining and Data Engineering

One of the standout features of the Yi models is the meticulous data engineering that underpins their training process. A colossal corpus of 3.1 trillion tokens, enriched in both English and Chinese, undergoes a rigorous cleaning and filtering process. This process includes sophisticated heuristic and learned filters, addressing the nuanced challenges, especially in Chinese content, to substantially improve data quality. The dedication to refining pretraining and fine-tuning data distinguishes the Yi models, demonstrating that quality trumps quantity in achieving superior model performance.

Architectural Choices

The Yi models adhere to a traditional transformer architecture, enhanced with tailored modifications such as Grouped-Query Attention and SwiGLU activation, which together ensure computational efficiency without sacrificing capability. The models are designed with an eye on scaling model parameters thoughtfully, balancing inference performance with serving costs for broader accessibility.

Fine-tuning for Chat Models

When it comes to fine-tuning for chat models, the Yi series deviates from large-scale instruction tuning, opting instead for a detail-oriented approach. Each piece of the fine-tuning dataset is meticulously crafted and iteratively polished, emphasizing data quality over sheer volume. This not only ensures the models' alignment with nuanced user preferences but also supports cost-efficient deployment on consumer-grade hardware through model and data quantization techniques.

Extending Capabilities

Beyond its foundational capabilities, the Yi model series is extended in three significant directions: adapting to 200K context length, integrating vision-language tasks, and exploring depth upscaling. These extensions are carefully engineered to unlock new dimensions of performance, from enhanced retrieval in elongated contexts to broadened multimodal understanding and improved depth capacity for nuanced reasoning.

Infrastructure and Safety Measures

Underpinning the development and deployment of the Yi models is a robust infrastructure that supports comprehensive scheduling, efficient training, and adaptive serving. Coupled with this technical backbone is a proactive approach to model safety, ensuring responsible use and alignment with ethical considerations through every stage of the model's lifecycle.

Evaluation and Community Impact

Extensive evaluation underscores the Yi models' competitive edge, not only in matching the performance of notable counterparts like GPT-3.5 but also in offering innovative solutions for deployment and user data privacy. The models stand as a testament to the potential of scaling up model parameters and optimizing data quality to push the boundaries of what LLMs can achieve.

Conclusion

The development and refinement of the Yi model series represent a confluence of rigorous data engineering, architectural innovation, and strategic capability extensions. Through detailed pretraining data processing, focused fine-tuning methodologies, and expansive infrastructure support, 01.AI positions the Yi models as powerful tools for research and application in the AI community. As we look to the future, the Yi series not only sets a new standard for LLM performance but also emphasizes the importance of ethical considerations and user-centric design in advancing artificial intelligence.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (32)

First 10 authors:

GitHub

GitHub - 01-ai/Yi: A series of large language models trained from scratch by developers @01-ai (7,364 stars)

Tweets

https://twitter.com/_philschmid/status/1766420686174581144

https://twitter.com/casper_hansen_/status/1766051393628987819

https://twitter.com/arankomatsuzaki/status/1765926870657159644

https://twitter.com/_akhaliq/status/1765932473853124825

https://twitter.com/erhartford/status/1766683705689710826

https://twitter.com/01AI_Yi/status/1770086889162486257