Bielik 7B v0.1: A Polish Language Model -- Development, Insights, and Evaluation (2410.18565v1)
Abstract: We introduce Bielik 7B v0.1, a 7-billion-parameter generative text model for Polish language processing. Trained on curated Polish corpora, this model addresses key challenges in LLM development through innovative techniques. These include Weighted Instruction Cross-Entropy Loss, which balances the learning of different instruction types, and Adaptive Learning Rate, which dynamically adjusts the learning rate based on training progress. To evaluate performance, we created the Open PL LLM Leaderboard and Polish MT-Bench, novel frameworks assessing various NLP tasks and conversational abilities. Bielik 7B v0.1 demonstrates significant improvements, achieving a 9 percentage point increase in average score compared to Mistral-7B-v0.1 on the RAG Reader task. It also excels in the Polish MT-Bench, particularly in Reasoning (6.15/10) and Role-playing (7.83/10) categories. This model represents a substantial advancement in Polish language AI, offering a powerful tool for diverse linguistic applications and setting new benchmarks in the field.
- GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, Singapore. Association for Computational Linguistics.
- PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. In 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24). ACM.
- Hicham Badri and Appu Shaji. 2023. Half-quadratic quantization of large machine learning models.
- The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 749–775, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
- Open llm leaderboard (2023-2024). https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard.
- Longformer: The long-document transformer. ArXiv, abs/2004.05150.
- Alpagasus: Training a better alpaca with fewer data.
- Generating long sequences with sparse transformers. ArXiv, abs/1904.10509.
- Sławomir Dadas. 2022. Training effective neural sentence encoders from automatically mined paraphrases. In 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 371–378.
- Evaluation of sentence representations in Polish. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 1674–1680, Marseille, France. European Language Resources Association.
- Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.
- Language modeling with gated convolutional networks. In International Conference on Machine Learning.
- GPTQ: Accurate post-training compression for generative pretrained transformers. arXiv preprint arXiv:2210.17323.
- A framework for few-shot language model evaluation.
- Learning rates as a function of batch size: A random matrix theory approach to neural network training. J. Mach. Learn. Res., 23:173:1–173:65.
- MLX: Efficient and flexible machine learning on apple silicon.
- Simple and scalable strategies to continually pre-train large language models.
- Kenji Imamura and Eiichiro Sumita. 2022. Extending the subwording model of multilingual pretrained models for new languages.
- Mistral 7b.
- Pre-rmsnorm and pre-crmsnorm transformers: equivalent and efficient pre-ln transformers. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA. Curran Associates Inc.
- Mt-bench pl. https://huggingface.co/spaces/speakleash/mt-bench-pl.
- Gary King and Langche Zeng. 2001. Logistic regression in rare events data. Political Analysis, 9(2):137–163.
- Multi-level sentiment analysis of PolEmo 2.0: Extended corpus of multi-domain consumer reviews. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 980–991, Hong Kong, China. Association for Computational Linguistics.
- Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing.
- Overcoming catastrophic forgetting during domain adaptation of seq2seq language generation. In North American Chapter of the Association for Computational Linguistics.
- Awq: Activation-aware weight quantization for llm compression and acceleration. In MLSys.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. In International Conference on Learning Representations.
- Open dataset for development of polish question answering systems. In Proceedings of the 6th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Wydawnictwo Poznanskie, Fundacja Uniwersytetu im. Adama Mickiewicza.
- Orca-math: Unlocking the potential of slms in grade school math.
- National Information Processing Institute and Gdańsk University of Technology. 2024. Qra models.
- Krzysztof Ociepa. 2023. Allamo: A simple, hackable, and fast framework for training medium-sized llms. https://github.com/chrisociepa/allamo.
- Krzysztof Ociepa and Azurro Team. 2024. Introducing apt3-1b-base: Polish language model. Accessed: 2024-09-30.
- Maciej Ogrodniczuk and Mateusz Kopeć. 2014. The Polish Summaries Corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014.
- Stylometrix: An open-source multilingual tool for representing stylometric vectors.
- Continual learning with foundation models: An empirical study of latent replay. In Proceedings of The 1st Conference on Lifelong Learning Agents, volume 199 of Proceedings of Machine Learning Research, pages 60–91. PMLR.
- Expert-annotated dataset to study cyberbullying in polish language. Data, 9(1):1.
- Addressing imbalance in multi-label classification using weighted cross entropy loss function. 2020 27th National and 5th International Iranian Conference on Biomedical Engineering (ICBME), pages 333–338.
- KLEJ: Comprehensive benchmark for polish language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1191–1201, Online. Association for Computational Linguistics.
- PolQA: Polish question answering dataset. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 12846–12855, Torino, Italia. ELRA and ICCL.
- Neural machine translation of rare words with subword units.
- Noam Shazeer. 2020. Glu variants improve transformer.
- Instruction tuning with loss over instructions.
- SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama.
- SpeakLeash Team. 2024. Speakleash a.k.a spichlerz! Accessed: 2024-09-30.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
- Teknium. 2023. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants. https://huggingface.co/datasets/teknium/OpenHermes-2.5.
- Poquad-the polish question answering dataset-description and analysis. In Proceedings of the 12th Knowledge Capture Conference 2023, pages 105–113.
- Attention is all you need. In Neural Information Processing Systems.
- Voicelab. 2023. Trurl 2 models).
- Openchat: Advancing open-source language models with mixed-quality data.
- Open pl llm leaderboard. https://huggingface.co/spaces/speakleash/open_pl_llm_leaderboard.
- Misclassification-guided loss under the weighted cross-entropy loss framework. Knowl. Inf. Syst., 66:4685–4720.
- Discriminator-weighted offline imitation learning from suboptimal demonstrations. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 24725–24742. PMLR.
- Tinyllama: An open-source small language model.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Lima: Less is more for alignment. In Advances in Neural Information Processing Systems, volume 36, pages 55006–55021. Curran Associates, Inc.