2000 character limit reached
TinyLlama: An Open-Source Small Language Model (2401.02385v2)
Published 4 Jan 2024 in cs.CL and cs.AI
Abstract: We present TinyLlama, a compact 1.1B LLM pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention and Lit-GPT), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source LLMs with comparable sizes. Our model checkpoints and code are publicly available on GitHub at https://github.com/jzhang38/TinyLlama.
- GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of EMNLP.
- Palm 2 technical report.
- Qwen technical report.
- Pythia: A suite for analyzing large language models across training and scaling. In Proceedings of ICML.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of AAAI.
- Language models are few-shot learners. In Proceedings of NeurIPS.
- INSTRUCTEVAL: towards holistic evaluation of instruction-tuned large language models. CoRR, abs/2306.04757.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of NAACL.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
- DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of NAACL.
- A framework for few-shot language model evaluation.
- Measuring massive multitask language understanding. In Proceedings of ICLR.
- Training compute-optimal large language models. In Proceedings of NeurIPS.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers.
- Starcoder: may the source be with you! Transactions on Machine Learning Research.
- Decoupled weight decay regularization. In Proceedings of ICLR.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of EMNLP.
- Scaling data-constrained language models. In Proceedings of NeurIPS.
- OpenAI (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- Shazeer, N. (2020). GLU variants improve transformer. CoRR, abs/2002.05202.
- SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864.
- Challenging BIG-bench tasks and whether chain-of-thought can solve them. In Findings of ACL.
- Thaddée, Y. T. (2023). Chinchilla’s death. https://espadrine.github.io/blog/posts/chinchilla-s-death.html.
- Together Computer (2023). Redpajama: an open dataset for training large language models.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Attention is all you need. In Proceedings of NeurIPS.
- Chain of thought prompting elicits reasoning in large language models. In Proceedings of NeurIPS.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the ACL.
- Root mean square layer normalization. In Proceedings of NeurIPS.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023, pages 5673–5684. ACM.