Papers
Topics
Authors
Recent
Search
2000 character limit reached

Breeze-7B Technical Report

Published 5 Mar 2024 in cs.CL | (2403.02712v2)

Abstract: Breeze-7B is an open-source LLM based on Mistral-7B, designed to address the need for improved language comprehension and chatbot-oriented capabilities in Traditional Chinese. This technical report provides an overview of the additional pretraining, finetuning, and evaluation stages for the Breeze-7B model. The Breeze-7B family of base and chat models exhibits good performance on language comprehension and chatbot-oriented tasks, reaching the top in several benchmarks among models comparable in its complexity class.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. The Falcon series of open language models, 2023.
  2. epfLLM Megatron-LLM, 2023. URL https://github.com/epfLLM/Megatron-LLM.
  3. Extending the pre-training of bloom for improved support of traditional chinese: Models, methods and results, 2023.
  4. Google Gemini-Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. 2024.
  5. Textbooks are all you need, June 2023. URL https://www.microsoft.com/en-us/research/publication/textbooks-are-all-you-need/.
  6. Advancing the evaluation of traditional chinese language models: Towards a comprehensive benchmark suite, 2023.
  7. Mistral 7B, 2023.
  8. Textbooks are all you need ii: phi-1.5 technical report. September 2023. URL https://www.microsoft.com/en-us/research/publication/textbooks-are-all-you-need-ii-phi-1-5-technical-report/.
  9. Chipnemo: Domain-adapted llms for chip design. arXiv preprint arXiv:2311.00176, 2023.
  10. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  11. OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  12. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  13. Neural machine translation of rare words with subword units. In Katrin Erk and Noah A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL https://aclanthology.org/P16-1162.
  14. Drcd: a chinese machine reading comprehension dataset. ArXiv, abs/1806.00920, 2018. URL https://api.semanticscholar.org/CorpusID:46932369.
  15. An improved traditional chinese evaluation suite for foundation model. arXiv, 2023.
  16. Llama 2: Open foundation and fine-tuned chat models, 2023.
  17. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
Citations (2)

Summary

  • The paper presents Breeze-7B, a model designed to improve Traditional Chinese language processing and interactive chatbot functions.
  • It employs rigorous data preprocessing, an extended tokenizer, and customized architecture inspired by Megatron-LLM and BLOOM for enhanced training efficiency.
  • Benchmarking results show superior language comprehension, long-context handling up to 32k tokens, and competitive performance in diverse chatbot scenarios.

Breeze-7B: Technical Advancements and Performance Evaluation

The technical report on Breeze-7B provides an in-depth analysis of its development, focusing on improving language comprehension and chatbot capabilities in Traditional Chinese. This essay outlines Breeze-7B's methodology, architecture, training, benchmarking, and results, positioning it as a remarkable open-source model tailored for Traditional Chinese language tasks.

Methodological Framework

Data Collection and Preprocessing

The creation of Breeze-7B builds upon Mistral-7B's foundation, with a significant emphasis on enhancing capabilities in Traditional Chinese. The model was trained on a meticulously curated dataset comprising 650GB of high-quality Traditional Chinese text, sourced through extensive web crawling and existing datasets. This deliberate selection aimed to overcome the limitations observed in Mistral-7B, particularly regarding factual accuracy and reasoning.

A robust preprocessing framework was employed to maximize data quality, inspired by methodologies from BLOOM and CCNet. This involved filtering strategies for removing unwanted noise and simplifying Chinese content, ensuring the dataset's relevance and richness.

Model Architecture Customization

Breeze-7B features several architectural modifications to optimize its performance for Traditional Chinese. An extended tokenizer was devised to bolster compression rates and enhance training speed, achieving a context length of 11.1k for Traditional Chinese text. The architecture was adapted to accommodate this expanded vocabulary, reflecting the model's ability to process extended sequences efficiently.

Training and Evaluation

Employing a flat learning rate on a massive dataset, Breeze-7B's training leveraged resources efficiently, capitalizing on techniques from the Megatron-LLM library for both tensor and data parallel training. A strong focus was placed on validating data quality, essential for tracking in-training progress through perplexity metrics. Figure 1

Figure 1: Perplexity (PPL) change during the additional pretraining stage of Breeze-7B, after the vocabulary size extension. The PPL scores are calculated using our proprietary Traditional Chinese validation dataset.

Instruction Finetuning

To enhance chatbot capabilities, Breeze-7B underwent instruction finetuning, utilizing English datasets to bolster its proficiency in varied conversational and task-oriented applications. This phase involved filtering and transforming data to ensure contextual appropriateness and efficacy in an instructional setting.

Benchmarking and Results

Breeze-7B was subjected to rigorous benchmarking to evaluate its language comprehension and chatbot capabilities across various scripted and unscripted environments.

Language Comprehension and Chatbot Benchmarks

The model was rigorously assessed against prominent benchmarks like TMMLU+ and DRCD for language comprehension, demonstrating strong performance across diverse disciplines and comprehending nuances in Traditional Chinese text effectively.

In chatbot-oriented benchmarks such as MT-Bench-tw, the model exhibited competitive conversational abilities, surpassing several peer models in task complexity and language adaptability.

Long-Context Capabilities

Breeze-7B-32k's ability to process extended contexts without performance degradation was a critical outcome. The passkey retrieval task illustrated its proficiency in managing long sequences, effectively extending its operational context to 32k tokens without losing accuracy. Figure 2

Figure 2: Passkey Retrieval results of Breeze-7B-Base and Breeze-7B-32k-Base. The y-axis denotes the input sequence length, while the x-axis denotes the depth of the key position in the example. Each length-depth combination is trialed 20 times and the accuracy is color-coded with the colormap at the bottom.

Conclusion

Breeze-7B stands as an exemplar of the advancements in LLMs tailored for Traditional Chinese, significantly enhancing language comprehension and interactive capabilities. The model's open-source availability encourages further innovation within the community, ensuring continuity in the exploration of Traditional Chinese linguistic and computational endeavors. Such developments not only promise operational gains in multilingual contexts but also pave the way for developments in enhancing non-English LLMs broadly.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.