Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bielik 7B v0.1: A Polish Language Model -- Development, Insights, and Evaluation (2410.18565v1)

Published 24 Oct 2024 in cs.CL and cs.AI

Abstract: We introduce Bielik 7B v0.1, a 7-billion-parameter generative text model for Polish language processing. Trained on curated Polish corpora, this model addresses key challenges in LLM development through innovative techniques. These include Weighted Instruction Cross-Entropy Loss, which balances the learning of different instruction types, and Adaptive Learning Rate, which dynamically adjusts the learning rate based on training progress. To evaluate performance, we created the Open PL LLM Leaderboard and Polish MT-Bench, novel frameworks assessing various NLP tasks and conversational abilities. Bielik 7B v0.1 demonstrates significant improvements, achieving a 9 percentage point increase in average score compared to Mistral-7B-v0.1 on the RAG Reader task. It also excels in the Polish MT-Bench, particularly in Reasoning (6.15/10) and Role-playing (7.83/10) categories. This model represents a substantial advancement in Polish language AI, offering a powerful tool for diverse linguistic applications and setting new benchmarks in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, Singapore. Association for Computational Linguistics.
  2. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. In 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24). ACM.
  3. Hicham Badri and Appu Shaji. 2023. Half-quadratic quantization of large machine learning models.
  4. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 749–775, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
  5. Open llm leaderboard (2023-2024). https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard.
  6. Longformer: The long-document transformer. ArXiv, abs/2004.05150.
  7. Alpagasus: Training a better alpaca with fewer data.
  8. Generating long sequences with sparse transformers. ArXiv, abs/1904.10509.
  9. Sławomir Dadas. 2022. Training effective neural sentence encoders from automatically mined paraphrases. In 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 371–378.
  10. Evaluation of sentence representations in Polish. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 1674–1680, Marseille, France. European Language Resources Association.
  11. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.
  12. Language modeling with gated convolutional networks. In International Conference on Machine Learning.
  13. GPTQ: Accurate post-training compression for generative pretrained transformers. arXiv preprint arXiv:2210.17323.
  14. A framework for few-shot language model evaluation.
  15. Learning rates as a function of batch size: A random matrix theory approach to neural network training. J. Mach. Learn. Res., 23:173:1–173:65.
  16. MLX: Efficient and flexible machine learning on apple silicon.
  17. Simple and scalable strategies to continually pre-train large language models.
  18. Kenji Imamura and Eiichiro Sumita. 2022. Extending the subwording model of multilingual pretrained models for new languages.
  19. Mistral 7b.
  20. Pre-rmsnorm and pre-crmsnorm transformers: equivalent and efficient pre-ln transformers. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA. Curran Associates Inc.
  21. Mt-bench pl. https://huggingface.co/spaces/speakleash/mt-bench-pl.
  22. Gary King and Langche Zeng. 2001. Logistic regression in rare events data. Political Analysis, 9(2):137–163.
  23. Multi-level sentiment analysis of PolEmo 2.0: Extended corpus of multi-domain consumer reviews. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 980–991, Hong Kong, China. Association for Computational Linguistics.
  24. Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing.
  25. Overcoming catastrophic forgetting during domain adaptation of seq2seq language generation. In North American Chapter of the Association for Computational Linguistics.
  26. Awq: Activation-aware weight quantization for llm compression and acceleration. In MLSys.
  27. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. In International Conference on Learning Representations.
  28. Open dataset for development of polish question answering systems. In Proceedings of the 6th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Wydawnictwo Poznanskie, Fundacja Uniwersytetu im. Adama Mickiewicza.
  29. Orca-math: Unlocking the potential of slms in grade school math.
  30. National Information Processing Institute and Gdańsk University of Technology. 2024. Qra models.
  31. Krzysztof Ociepa. 2023. Allamo: A simple, hackable, and fast framework for training medium-sized llms. https://github.com/chrisociepa/allamo.
  32. Krzysztof Ociepa and Azurro Team. 2024. Introducing apt3-1b-base: Polish language model. Accessed: 2024-09-30.
  33. Maciej Ogrodniczuk and Mateusz Kopeć. 2014. The Polish Summaries Corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014.
  34. Stylometrix: An open-source multilingual tool for representing stylometric vectors.
  35. Continual learning with foundation models: An empirical study of latent replay. In Proceedings of The 1st Conference on Lifelong Learning Agents, volume 199 of Proceedings of Machine Learning Research, pages 60–91. PMLR.
  36. Expert-annotated dataset to study cyberbullying in polish language. Data, 9(1):1.
  37. Addressing imbalance in multi-label classification using weighted cross entropy loss function. 2020 27th National and 5th International Iranian Conference on Biomedical Engineering (ICBME), pages 333–338.
  38. KLEJ: Comprehensive benchmark for polish language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1191–1201, Online. Association for Computational Linguistics.
  39. PolQA: Polish question answering dataset. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 12846–12855, Torino, Italia. ELRA and ICCL.
  40. Neural machine translation of rare words with subword units.
  41. Noam Shazeer. 2020. Glu variants improve transformer.
  42. Instruction tuning with loss over instructions.
  43. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama.
  44. SpeakLeash Team. 2024. Speakleash a.k.a spichlerz! Accessed: 2024-09-30.
  45. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
  46. Teknium. 2023. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants. https://huggingface.co/datasets/teknium/OpenHermes-2.5.
  47. Poquad-the polish question answering dataset-description and analysis. In Proceedings of the 12th Knowledge Capture Conference 2023, pages 105–113.
  48. Attention is all you need. In Neural Information Processing Systems.
  49. Voicelab. 2023. Trurl 2 models).
  50. Openchat: Advancing open-source language models with mixed-quality data.
  51. Open pl llm leaderboard. https://huggingface.co/spaces/speakleash/open_pl_llm_leaderboard.
  52. Misclassification-guided loss under the weighted cross-entropy loss framework. Knowl. Inf. Syst., 66:4685–4720.
  53. Discriminator-weighted offline imitation learning from suboptimal demonstrations. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 24725–24742. PMLR.
  54. Tinyllama: An open-source small language model.
  55. Judging llm-as-a-judge with mt-bench and chatbot arena.
  56. Lima: Less is more for alignment. In Advances in Neural Information Processing Systems, volume 36, pages 55006–55021. Curran Associates, Inc.

Summary

  • The paper introduces a cutting-edge 7B-parameter model that leverages weighted instruction cross-entropy loss and adaptive learning rates for efficient training.
  • It demonstrates enhanced performance on Polish NLP benchmarks, outperforming its predecessor by nearly 9 percentage points on the RAG Reader task.
  • The research sets a precedent for future AI in underrepresented languages while outlining advancements in tokenization and ethical considerations.

Overview of Bielik 7B v0.1: A Polish LLM

The development of large-scale LLMs has traditionally focused on extensively resourced languages like English, marginalizing those with fewer digital resources. The paper "Bielik 7B v0.1: A Polish LLM – Development, Insights, and Evaluation" addresses this gap by introducing a generative text model specifically designed for Polish. This 7-billion-parameter model leverages distinct techniques, demonstrates notable performance on Polish NLP tasks, and establishes a framework for future advancements in non-English AI research.

Development and Techniques

Bielik 7B v0.1 builds upon the foundation of the Mistral 7B v0.1 model. The authors highlight several critical innovations:

  • Weighted Instruction Cross-Entropy Loss and Adaptive Learning Rate techniques are deployed to optimize model training. These strategies ensure balanced learning and efficient convergence by adjusting the learning rate dynamically.
  • A diverse dataset, primarily composed of Polish texts, was curated. This dataset underwent rigorous preprocessing and quality evaluation, yielding a robust training corpus of 22 billion tokens (supplemented with 14 billion English tokens).

Model Architecture

The architecture of Bielik 7B v0.1 follows a Transformer-based design with notable configurations such as 32 layers and 32 attention heads. The model integrates advanced features like Rotary Positional Embeddings, SwiGLU activation, and Root Mean Square Layer Normalization, which collectively enhance performance on Polish language processing.

Tokenization

While relying on the Mistral 7B model's tokenizer, the authors attempted to expand and refine it to better suit Polish syntax and morphology. Tokenization efficiency was evaluated through metrics like tokens per word and characters per token, although issues with incorrect token combinations were acknowledged.

Evaluation and Results

Performance evaluations were conducted through two primary frameworks: the Open PL LLM Leaderboard and the Polish MT-Bench.

  1. Open PL LLM Leaderboard: Bielik 7B v0.1 outperformed its predecessor (Mistral-7B-v0.1) in the RAG Reader task by almost 9 percentage points, demonstrating competitive scores on various NLP benchmarks (e.g., sentiment analysis, named entity recognition).
  2. Polish MT-Bench: In this conversational and instruction-following evaluation, Bielik 7B v0.1 was effective particularly in reasoning and role-playing tasks, reflecting its comprehensive conversational abilities.

Implications and Future Directions

Bielik 7B v0.1 marks a significant stride in Polish AI, providing a compelling resource for NLP tasks in less-resourced languages. While the model sets new benchmarks, its development invites further exploration into:

  • Broader Linguistic Application: Extending techniques to other underrepresented languages.
  • Ethical Considerations: Approaches to mitigate bias and misinformation inherent to models trained on expansive web-crawled data.
  • Computational Efficiency: Techniques like quantization and calibration ensure the model remains accessible for limited-resource environments, fostering its use in practical applications.

Conclusion

This research underscores the potential of leveraging advanced ML techniques to empower LLMs in diverse linguistic contexts. While not revolutionary in a global AI sense, Bielik 7B v0.1 distinctly enhances Polish NLP, offering a foundational step toward inclusive AI technologies. Future iterations could see more efficient tokenization methods and enriched datasets to further enhance both scope and efficacy of the model's outputs.