Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TinyLlama: An Open-Source Small Language Model (2401.02385v2)

Published 4 Jan 2024 in cs.CL and cs.AI

Abstract: We present TinyLlama, a compact 1.1B LLM pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention and Lit-GPT), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source LLMs with comparable sizes. Our model checkpoints and code are publicly available on GitHub at https://github.com/jzhang38/TinyLlama.

Overview of TinyLlama

TinyLlama is a recently developed LLM pre-trained on an extensive collection of about 1 trillion tokens. Despite its size, being a 1.1B parameter model, it shows superior performance when benchmarked against other open-source models of similar scales. The integration of innovations from the open-source community, such as FlashAttention, has ensured its competitive edge in terms of computational efficiency.

Pre-training Methodology

The researchers introduced a pre-training methodology for TinyLlama that combines natural language data from the SlimPajama corpus and code data from Starcoderdata. They used Llama's tokenizer, maintained a natural language to code data ratio of approximately 7:3, and conducted training over about three epochs. The architecture echoed Llama 2, utilizing elements like Rotary Positional Embedding, RMSNorm, SwiGLU for activation, and grouped-query attention to optimize performance. Moreover, advanced techniques such as Fully Sharded Data Parallel and Flash Attention were employed to augment training speed while reducing GPU memory requirements.

Performance Analysis

TinyLlama was evaluated across a variety of commonsense reasoning and problem-solving tasks. Compared to other open-source LLMs like OPT-1.3B and Pythia variants, TinyLlama not only emerged superior in many tasks but also demonstrated quick performance improvements as computational resources were ramped up. Notable strides were made immediately after refinement of the training data, which previously contained an over-insertion of end-of-sequence tokens.

Conclusion and Contributions

Beyond constructing an efficient LLM, the research team has committed to the principles of transparency and accessibility by making TinyLlama's code and checkpoints publicly available. Its exceptional performance, combined with its relatively reduced size, renders it a valuable resource to both researchers and developers. The team expects to leverage its gathered experiences to further refine TinyLlama, expanding its abilities and practical applications, with further details and updated iterations to be chronicled in the future. The undertaking has been possible thanks to significant support from the academic and funding communities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of EMNLP.
  2. Palm 2 technical report.
  3. Qwen technical report.
  4. Pythia: A suite for analyzing large language models across training and scaling. In Proceedings of ICML.
  5. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of AAAI.
  6. Language models are few-shot learners. In Proceedings of NeurIPS.
  7. INSTRUCTEVAL: towards holistic evaluation of instruction-tuned large language models. CoRR, abs/2306.04757.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  9. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL.
  10. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  11. Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
  12. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of NAACL.
  13. A framework for few-shot language model evaluation.
  14. Measuring massive multitask language understanding. In Proceedings of ICLR.
  15. Training compute-optimal large language models. In Proceedings of NeurIPS.
  16. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  17. xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers.
  18. Starcoder: may the source be with you! Transactions on Machine Learning Research.
  19. Decoupled weight decay regularization. In Proceedings of ICLR.
  20. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of EMNLP.
  21. Scaling data-constrained language models. In Proceedings of NeurIPS.
  22. OpenAI (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  23. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
  24. Shazeer, N. (2020). GLU variants improve transformer. CoRR, abs/2002.05202.
  25. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.
  26. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  27. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864.
  28. Challenging BIG-bench tasks and whether chain-of-thought can solve them. In Findings of ACL.
  29. Thaddée, Y. T. (2023). Chinchilla’s death. https://espadrine.github.io/blog/posts/chinchilla-s-death.html.
  30. Together Computer (2023). Redpajama: an open dataset for training large language models.
  31. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  32. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  33. Attention is all you need. In Proceedings of NeurIPS.
  34. Chain of thought prompting elicits reasoning in large language models. In Proceedings of NeurIPS.
  35. HellaSwag: Can a machine really finish your sentence? In Proceedings of the ACL.
  36. Root mean square layer normalization. In Proceedings of NeurIPS.
  37. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  38. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023, pages 5673–5684. ACM.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Peiyuan Zhang (24 papers)
  2. Guangtao Zeng (14 papers)
  3. Tianduo Wang (5 papers)
  4. Wei Lu (325 papers)
Citations (255)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

Reddit Logo Streamline Icon: https://streamlinehq.com