Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 80 tok/s
Gemini 2.5 Pro 28 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 125 tok/s Pro
Kimi K2 181 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

TeenyTinyLlama: open-source tiny language models trained in Brazilian Portuguese (2401.16640v3)

Published 30 Jan 2024 in cs.CL and cs.LG

Abstract: LLMs have significantly advanced natural language processing, but their progress has yet to be equal across languages. While most LLMs are trained in high-resource languages like English, multilingual models generally underperform monolingual ones. Additionally, aspects of their multilingual foundation sometimes restrict the byproducts they produce, like computational demands and licensing regimes. In this study, we document the development of open-foundation models tailored for use in low-resource settings, their limitations, and their benefits. This is the TeenyTinyLlama pair: two compact models for Brazilian Portuguese text generation. We release them under the permissive Apache 2.0 license on GitHub and Hugging Face for community use and further development. See https://github.com/Nkluge-correa/TeenyTinyLlama

Definition Search Book Streamline Icon: https://streamlinehq.com
References (131)
  1. Towards a cleaner document-oriented multilingual crawled corpus. arXiv preprint arXiv:2201.06642.
  2. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245.
  3. Adapting pre-trained language models to african languages via multilingual adaptive fine-tuning. arXiv preprint arXiv:2204.06487.
  4. The falcon series of open language models. arXiv preprint arXiv:2311.16867.
  5. AraGPT2: Pre-trained transformer for Arabic language generation. In Habash, N., Bouamor, H., Hajj, H., Magdy, W., Zaghouani, W., Bougares, F., Tomeh, N., Abu Farha, I., and Touileb, S., editors, Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 196–207, Kyiv, Ukraine (Virtual). Association for Computational Linguistics.
  6. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
  7. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  8. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  9. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954.
  10. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  11. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745.
  12. Bruno Henrique (2024a). Caramelo 7b. https://huggingface.co/Bruno/Caramelo_7B.
  13. Bruno Henrique (2024b). Harpia-7b-guanacolora. https://huggingface.co/Bruno/Harpia-7b-guanacoLora.
  14. Carlo Moro (2024). Instruct-ptbr-enus-11m. https://huggingface.co/datasets/cnmoro/Instruct-PTBR-ENUS-11M.
  15. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174.
  16. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  17. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  18. CodeCarbon (2019). Codecarbon: Track emissions from compute and recommend ways to reduce their impact on the environment. https://github.com/mlco2/codecarbon.
  19. Computer, T. (2023). Redpajama: an open dataset for training large language models. https://github.com/togethercomputer/RedPajama-Data.
  20. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
  21. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  22. Free dolly: Introducing the world’s first truly open instruction-tuned llm.
  23. Corrêa, N. K. (2023a). Aira.
  24. Corrêa, N. K. (2023b). Instruct-aira dataset version 2.0. https://huggingface.co/datasets/nicholasKluge/instruct-aira-dataset-v2.
  25. Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177.
  26. Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
  27. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
  28. Compute and energy consumption trends in deep learning inference. arXiv preprint arXiv:2109.05472.
  29. 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR.
  30. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  31. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. arXiv preprint arXiv:2304.03208.
  32. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
  33. Multifit: Efficient multi-lingual language model fine-tuning. arXiv preprint arXiv:1909.04761.
  34. Challenging ai for sustainability: what ought it mean? AI and Ethics, pages 1–11.
  35. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  36. A framework for few-shot language model evaluation. Version v0. 0.1. Sept.
  37. Introducing bode: A fine-tuned large language model for portuguese prompt-based task.
  38. Estimation of energy consumption in machine learning. Journal of Parallel and Distributed Computing, 134:75–88.
  39. Koala: A dialogue model for academic research. Blog post.
  40. Openllama: An open reproduction of llama. https://github.com/openlm-research/open_llama.
  41. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
  42. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate.
  43. Guillou, P. (2020). Gportuguese-2 (portuguese gpt-2 small): a language model for portuguese text generation (and more nlp tasks…). https://huggingface.co/pierreguillou/gpt2-small-portuguese.
  44. Textbooks are all you need. arXiv preprint arXiv:2306.11644.
  45. Maria: Spanish language models. arXiv preprint arXiv:2107.07253.
  46. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  47. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  48. Cosmos QA: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2391–2401, Hong Kong, China. Association for Computational Linguistics.
  49. HuggingFace (2019). Tokenizers: Fast state-of-the-art tokenizers optimized for research and production. https://github.com/huggingface/tokenizers.
  50. Mistral 7b. arXiv preprint arXiv:2310.06825.
  51. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  52. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  53. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  54. A technical report for polyglot-ko: Open-source large-scale korean language models. arXiv preprint arXiv:2306.02254.
  55. The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533.
  56. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327.
  57. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
  58. Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700.
  59. Open multilingual llm evaluation leaderboard. https://huggingface.co/spaces/uonlp/open_multilingual_llm_leaderboard.
  60. adaptmllm: Fine-tuning multilingual language models on low-resource languages with integrated llm playgrounds. Information, 14(12):638.
  61. Cabrita: closing the gap for foreign languages. arXiv preprint arXiv:2308.11878.
  62. The bigscience roots corpus: A 1.6 tb composite multilingual dataset. Advances in Neural Information Processing Systems, 35:31809–31826.
  63. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499.
  64. Leonardo Souza (2024). Samba. https://huggingface.co/lrds-code/samba-1.1B.
  65. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  66. Bactrian-x : A multilingual replicable instruction-following model with low-rank adaptation.
  67. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978.
  68. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  69. Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668.
  70. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  71. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  72. Energy usage reports: Environmental awareness as part of algorithmic accountability. arXiv preprint arXiv:1911.08354.
  73. Estimating the carbon footprint of bloom, a 176b parameter language model, doi: 10.48550. arXiv preprint ARXIV.2211.02001.
  74. Yayi 2: Multilingual open-source large language models. arXiv preprint arXiv:2312.14862.
  75. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
  76. Maicon Domingues (2023a). Canarim-7b (revision 08fdd2b). https://huggingface.co/dominguesm/canarim-7b.
  77. Maicon Domingues (2023b). Canarim-instruct-ptbr-dataset (revision c2de751). https://huggingface.co/datasets/dominguesm/Canarim-Instruct-PTBR-Dataset.
  78. CamemBERT: a tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7203–7219, Online. Association for Computational Linguistics.
  79. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264.
  80. Indt5: a text-to-text transformer for 10 indigenous languages. arXiv preprint arXiv:2104.07483.
  81. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages.
  82. Democratizing llms for low-resource languages by leveraging their english dominant abilities with linguistically-diverse prompts. arXiv preprint arXiv:2306.11372.
  83. Nicolas de Camaret (2024). Cabra. https://huggingface.co/nicolasdec/Cabra.
  84. NousResearch (2024). Nous-hermes-2-yi-34b.
  85. A monolingual approach to contextualized word embeddings for mid-resource languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1703–1714, Online. Association for Computational Linguistics.
  86. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pages 9 – 16, Mannheim. Leibniz-Institut f"ur Deutsche Sprache.
  87. Fully sharded data parallel: faster ai training with fewer gpus.
  88. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
  89. Clueweb22: 10 billion web documents with rich information. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3360–3362.
  90. Pablo Filetti Moreira (2024). Gpt4all-j prompt generations pt. https://huggingface.co/datasets/pablo-moreira/gpt4all-j-prompt-generations-pt.
  91. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  92. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048.
  93. Sabi\\\backslash\’a: Portuguese large language models. arXiv preprint arXiv:2304.07880.
  94. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  95. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
  96. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints.
  97. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
  98. The assin 2 shared task: a quick overview. In International Conference on Computational Processing of the Portuguese Language, pages 406–412. Springer.
  99. Advancing neural encoding of portuguese with transformer albertina pt. arXiv preprint arXiv:2305.06721.
  100. Rodrigues, R. C. (2023). Faquad-nli: a benchmark for textual entailment. https://huggingface.co/datasets/ruanchaves/faquad-nli.
  101. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  102. Gottbert: a pure german language model. arXiv preprint arXiv:2012.02110.
  103. Shazeer, N. (2020). Glu variants improve transformer. arXiv preprint arXiv:2002.05202.
  104. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR.
  105. Mixture-of-experts meets instruction tuning: A winning combination for large language models. arXiv preprint arXiv:2305.14705.
  106. mgpt: Few-shot learners go multilingual. arXiv preprint arXiv:2204.07580.
  107. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
  108. Bertimbau: pretrained bert models for brazilian portuguese. In Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part I 9, pages 403–417. Springer.
  109. Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243.
  110. Roformer: Enhanced transformer with rotary position embedding. corr abs/2104.09864 (2021). arXiv preprint arXiv:2104.09864.
  111. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7.
  112. Team, S. A. L. (2024). Stable lm 2 1.6b.
  113. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  114. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  115. Hatebr: A large expert annotated corpus of brazilian instagram comments for offensive language and hate speech detection. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 7174–7183.
  116. Attention is all you need. Advances in neural information processing systems, 30.
  117. The brwac corpus: A new open resource for brazilian portuguese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
  118. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model.
  119. Weights&Biases (2017). Weights & biases: A tool for visualizing and tracking your machine learning experiments. https://github.com/wandb/wandb.
  120. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4003–4012, Marseille, France. European Language Resources Association.
  121. Wikimedia Foundation (2024). Wikimedia Downloads. https://dumps.wikimedia.org.
  122. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  123. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  124. To repeat or not to repeat: Insights from scaling llm under token-crisis. arXiv preprint arXiv:2305.13230.
  125. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
  126. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
  127. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32.
  128. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.
  129. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  130. Character-level convolutional networks for text classification. In NIPS.
  131. Llama beyond english: An empirical study on language capability transfer. arXiv preprint arXiv:2401.01055.
Citations (7)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 posts and received 0 likes.