Nyonic Technical Report
Abstract: This report details the development and key achievements of our latest LLM designed for custom LLMs. The advancements introduced include a novel Online Data Scheduler that supports flexible training data adjustments and curriculum learning. The model's architecture is fortified with state-of-the-art techniques such as Rotary Positional Embeddings, QK-LayerNorm, and a specially crafted multilingual tokenizer to enhance stability and performance. Moreover, our robust training framework incorporates advanced monitoring and rapid recovery features to ensure optimal efficiency. Our Wonton 7B model has demonstrated competitive performance on a range of multilingual and English benchmarks. Future developments will prioritize narrowing the performance gap with more extensively trained models, thereby enhancing the model's real-world efficacy and adaptability.GitHub: \url{https://github.com/nyonicai/nyonic-public}
- Layer normalization. CoRR, abs/1607.06450, 2016.
- Qwen technical report, 2023.
- The belebele benchmark: a parallel reading comprehension dataset in 122 language variants, 2023.
- PIQA: reasoning about physical commonsense in natural language. In AAAI, pages 7432–7439. AAAI Press, 2020. doi: 10.1609/aaai.v34i05.6239.
- Byte pair encoding is suboptimal for language model pretraining. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617–4624, Online, November 2020. Association for Computational Linguistics.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, NAACL-HLT, pages 2924–2936, 2019.
- XNLI: Evaluating cross-lingual sentence representations. In EMNLP, pages 2475–2485, Brussels, Belgium, October-November 2018.
- FlashAttention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS, 2022.
- Scaling vision transformers to 22 billion parameters, 2023.
- Camels in a changing climate: Enhancing lm adaptation with tulu 2, 2023.
- Mistral 7b, 2023.
- SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Eduardo Blanco and Wei Lu, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012.
- Openassistant conversations – democratizing large language model alignment, 2023.
- RACE: Large-scale ReAding comprehension dataset from examinations. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1082.
- xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers, 2022.
- Few-shot learning with multilingual language models, 2022.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- The LAMBADA dataset: Word prediction requiring a broad discourse context. In ACL, 2016.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
- Language models are unsupervised multitask learners. 2019.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450379984. doi: 10.1145/3394486.3406703.
- Winogrande: An adversarial winograd schema challenge at scale, 2019.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- It’s All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3534–3546, Online, August 2021.
- Llama: Open and efficient foundation language models, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. 2023b.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022.
- Doremi: Optimizing data mixtures speeds up language model pretraining, 2023.
- On layer normalization in the transformer architecture. In ICML, pages 10524–10533, 2020.
- Baichuan 2: Open large-scale language models. Technical report, Baichuan Inc., 2023.
- HellaSwag: Can a machine really finish your sentence? In ACL, pages 4791–4800, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.