Papers
Topics
Authors
Recent
Search
2000 character limit reached

Nyonic Technical Report

Published 24 Apr 2024 in cs.CL | (2404.15702v1)

Abstract: This report details the development and key achievements of our latest LLM designed for custom LLMs. The advancements introduced include a novel Online Data Scheduler that supports flexible training data adjustments and curriculum learning. The model's architecture is fortified with state-of-the-art techniques such as Rotary Positional Embeddings, QK-LayerNorm, and a specially crafted multilingual tokenizer to enhance stability and performance. Moreover, our robust training framework incorporates advanced monitoring and rapid recovery features to ensure optimal efficiency. Our Wonton 7B model has demonstrated competitive performance on a range of multilingual and English benchmarks. Future developments will prioritize narrowing the performance gap with more extensively trained models, thereby enhancing the model's real-world efficacy and adaptability.GitHub: \url{https://github.com/nyonicai/nyonic-public}

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Layer normalization. CoRR, abs/1607.06450, 2016.
  2. Qwen technical report, 2023.
  3. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants, 2023.
  4. PIQA: reasoning about physical commonsense in natural language. In AAAI, pages 7432–7439. AAAI Press, 2020. doi: 10.1609/aaai.v34i05.6239.
  5. Byte pair encoding is suboptimal for language model pretraining. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617–4624, Online, November 2020. Association for Computational Linguistics.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  8. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, NAACL-HLT, pages 2924–2936, 2019.
  9. XNLI: Evaluating cross-lingual sentence representations. In EMNLP, pages 2475–2485, Brussels, Belgium, October-November 2018.
  10. FlashAttention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS, 2022.
  11. Scaling vision transformers to 22 billion parameters, 2023.
  12. Camels in a changing climate: Enhancing lm adaptation with tulu 2, 2023.
  13. Mistral 7b, 2023.
  14. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Eduardo Blanco and Wei Lu, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012.
  15. Openassistant conversations – democratizing large language model alignment, 2023.
  16. RACE: Large-scale ReAding comprehension dataset from examinations. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1082.
  17. xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers, 2022.
  18. Few-shot learning with multilingual language models, 2022.
  19. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  20. The LAMBADA dataset: Word prediction requiring a broad discourse context. In ACL, 2016.
  21. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  22. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  23. Language models are unsupervised multitask learners. 2019.
  24. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450379984. doi: 10.1145/3394486.3406703.
  25. Winogrande: An adversarial winograd schema challenge at scale, 2019.
  26. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  27. It’s All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3534–3546, Online, August 2021.
  28. Llama: Open and efficient foundation language models, 2023a.
  29. Llama 2: Open foundation and fine-tuned chat models. 2023b.
  30. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  31. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022.
  32. Doremi: Optimizing data mixtures speeds up language model pretraining, 2023.
  33. On layer normalization in the transformer architecture. In ICML, pages 10524–10533, 2020.
  34. Baichuan 2: Open large-scale language models. Technical report, Baichuan Inc., 2023.
  35. HellaSwag: Can a machine really finish your sentence? In ACL, pages 4791–4800, 2019.

Summary

  • The paper introduces the Wonton 7B model with a novel Online Data Scheduler that dynamically adjusts training data for efficient curriculum learning.
  • It employs advanced tokenization with a tailored multilingual vocabulary and transformer enhancements like Rotary Positional Embeddings and QK-LayerNorm to improve model stability.
  • Robust experimental results highlight its competitive performance on multilingual benchmarks and scalable deployment using cutting-edge infrastructure.

Development and Evaluation of a Novel LLM Architecture: The Wonton 7B

Introduction

In this report, the development of the Wonton 7B model is detailed, highlighting significant improvements in the areas of data scheduling, tokenization, model architecture, and deployment strategies. This model integrates advanced components like an Online Data Scheduler and utilizes cutting-edge techniques such as Rotary Positional Embeddings and QK-LayerNorm. The model's performance is benchmarked on a variety of tasks, demonstrating its efficacy in multilingual and English contexts.

Data Scheduling Innovations

The Wonton 7B leverages a novel Online Data Scheduler to dynamically adjust training data, supporting an efficient and flexible training process. Key benefits of this scheduler include:

  • Real-time adjustments to training data ratios based on immediate model feedback.
  • Curriculum learning capabilities that focus training efforts on more challenging or relevant data, optimizing computational resources.
  • Efficient data loading and processing achieved through an integrated multiplexer and content stuffing approach, allowing seamless mixtures of data from various sources.

Advanced Tokenization Techniques

The model employs a multilingual tokenizer using byte-pair encoding (BPE) with a tailored vocabulary of 139,000 tokens. This tokenizer:

  • Efficiently handles diverse data sources including code and multilingual text.
  • Optimizes model performance through well-tuned vocabulary capacities, balancing transformer computational needs and meaning extraction efficiency.

Model Architecture and Training

Wonton 7B constructs upon the transformer architecture, benefiting from:

  • Rotary Positional Embeddings (RoPE) which offer a nuanced approach to incorporating sequence position information.
  • QK-LayerNorm applied in the pre-attention stages to enhance training stability by normalizing attention logits.
  • Max-z Loss supplementation to maintain controlled logit values during training, promoting robust and stable learning outcomes.

The model's training utilized the AdamW optimizer with specific attention to learning rate adjustments and weight decay settings conducive to optimal performance.

Infrastructure and Deployment

Utilizing the combined strengths of PyTorch, DeepSpeed, and NVIDIA's technologies like FlashAttention and TensorRT, the Wonton 7B model achieves robust training throughput and inference efficiency. Deployment on Alibaba's Aliyun EAS ensures scalable and secure model serving capabilities.

Experimental Validation

The model was rigorously tested against benchmarks like Lambada, WinoGrande, and various multilingual tasks from XNLI and Belebele. Wonton 7B displayed competitive performance across these, suggesting effective learning strategies and architecture choices, though still trailing behind more extensively trained models like Mistral 7B in some complex reasoning tasks.

Conclusion and Future Work

The Wonton 7B represents a thoughtful integration of novel AI techniques and infrastructure decisions, yielding a high-performance model with strong multilingual capabilities. Future work will focus on closing the performance gap identified in specialized tasks and extending the model’s application range to additional languages and domains. The current findings and model assets are made available for community use and further development, promising a continued enhancement of LLM capabilities.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 29 likes about this paper.