Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Falcon Series of Open Language Models (2311.16867v2)

Published 28 Nov 2023 in cs.CL and cs.AI

Abstract: We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurrently developed models such as LLaMA 2 or Inflection-1. It nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it, to our knowledge, one of the three best LLMs in the world along with GPT-4 and PaLM-2-Large. We report detailed evaluations, as well as a deep dive into the methods and custom tooling employed to pretrain Falcon. Notably, we report on our custom distributed training codebase, allowing us to efficiently pretrain these models on up to 4,096 A100s on cloud AWS infrastructure with limited interconnect. We release a 600B tokens extract of our web dataset, as well as the Falcon-7/40/180B models under a permissive license to foster open-science and accelerate the development of an open ecosystem of LLMs.

The Falcon series, introduced by the Technology Innovation Institute, comprises three models: Falcon-7B, Falcon-40B, and Falcon-180B, each scaling in size and computational resources. The largest, Falcon-180B, is notable for its training on an unprecedented 3,500 billion tokens of text data. These models are presented as significant contributions to the field of open LLMs, with the 180B variant being released under a responsible AI license while the smaller models are under Apache 2.0 license.

The research leading to the development of Falcon models involved extensive experimentation to fine-tune the architecture and pretraining datasets. The team took an innovative approach by relying heavily on high-quality web data, carefully filtered and deduplicated, challenging the belief that curated datasets are superior for training LLMs. This led to the decision not to repeat data during training, to avoid issues with data memorization and degradation. For the architecture, the team incorporated a variant of multiquery attention, known as multigroup attention, to improve inference efficiency, particularly in reducing the size of the required memory cache.

Implementation-wise, the Falcon models are trained on cloud infrastructure, using cost-efficient methods and hardware like A100-40GB GPUs. This is enabled by a custom distributed training framework, Gigatron, which utilizes 3D parallelism and ZeRO optimizer sharding to optimize for memory and computational efficiency. Additionally, FlashAttention kernels are used to expedite training further.

Upon evaluation, Falcon-180B demonstrates competitive performance on a variety of natural language processing tasks, positioning itself among the top LLMs like the ones from OpenAI's GPT series and Google's PaLM. Through evaluations using the EleutherAI Evaluation Harness, the Falcon series models not only exhibit strong performance on NLP benchmarks but also demonstrate potential for specialization in areas like chatbot development and code-related tasks.

The authors acknowledge limitations in their research, including the potential for different results at larger scales and the possible need to decouple training from inference compute to manage downstream deployment costs. Moreover, Falcon models, predominantly trained on English web data, may struggle with out-of-scope languages and domains.

The release of Falcon models and a portion of the RefinedWeb dataset under open licenses represents a push towards democratization of AI research, fostering collaboration, and ensuring responsible use of LLMs. The models and accompanying research documentation have been made publicly available with the intention of contributing to collective advancement in AI technology.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (173)
  1. TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
  2. Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. arXiv e-prints, page arXiv:2201.06642.
  3. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977.
  4. Scaling laws for generative mixed-modal language models. arXiv preprint arXiv:2301.03728.
  5. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245.
  6. Aleph Alpha (2023). Luminous: performance benchmarks. arXiv preprint arXiv:1810.12885.
  7. Santacoder: don’t reach for the stars! In Deep Learning for Code (DL4C) Workshop.
  8. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  9. Efficient large scale language modeling with mixtures of experts. arXiv preprint arXiv:2112.10684.
  10. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  11. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  12. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  13. The pushshift reddit dataset.
  14. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255.
  15. Scibert: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620.
  16. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  17. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
  18. Gpt-neox-20b: An open-source autoregressive language model. In Proceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136.
  19. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. If you use this software, please cite it using these metadata.
  20. Jax: Autograd and xla. Astrophysics Source Code Library, pages ascl–2111.
  21. Broder, A. Z. (1997). On the resemblance and containment of documents. In Proceedings. Compression and Complexity of Sequences 1997, pages 21–29. IEEE.
  22. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901.
  23. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations.
  24. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.
  25. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  26. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
  27. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937.
  28. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  29. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  30. Unified scaling laws for routed language models. In International Conference on Machine Learning, pages 4057–4086. PMLR.
  31. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL.
  32. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457.
  33. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  34. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451.
  35. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
  36. Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
  37. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
  38. Llm.int8(): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332.
  39. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  40. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. arXiv preprint arXiv:2304.03208.
  41. Effective theory of transformers at initialization. arXiv preprint arXiv:2304.02034.
  42. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305.
  43. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387.
  44. Ethnologue: Languages of the World. SIL International, Dallas, TX, USA, twenty-sixth edition.
  45. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270.
  46. Pot: Python optimal transport. Journal of Machine Learning Research, 22(78):1–8.
  47. What’s going on with the open llm leaderboard? "https://huggingface.co/blog/evaluating-mmlu-leaderboard".
  48. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations.
  49. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
  50. The Pile: an 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  51. A framework for few-shot language model evaluation.
  52. SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 394–398, Montréal, Canada. Association for Computational Linguistics.
  53. Textbooks are all you need. arXiv preprint arXiv:2306.11644.
  54. news-please: A generic news crawler and extractor. In Proceedings of the 15th International Symposium of Information Science, pages 218–223.
  55. Transformer language models without positional encodings still learn positional information. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1382–1390.
  56. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  57. Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487.
  58. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409.
  59. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  60. Hooker, S. (2021). The hardware lottery. Communications of the ACM, 64(12):58–65.
  61. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339.
  62. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462.
  63. Inflection (2023). Inflection 1.
  64. Crowdsourcing multiple choice science questions.
  65. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410.
  66. kaiokenmdenv (2023). Extending context is hard… but not impossible. Accessed: 2023-10-02.
  67. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  68. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  69. The stack: 3 tb of permissively licensed source code. Transactions on Machine Learning Research.
  70. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
  71. Race: Large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794.
  72. The bigscience ROOTS corpus: A 1.6TB composite multilingual dataset. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  73. What language model to train if you have one million gpu hours? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 765–782.
  74. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445.
  75. Limits to depth efficiencies of self-attention. Advances in Neural Information Processing Systems, 33:22640–22651.
  76. Solving quantitative reasoning problems with language models.
  77. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
  78. Sequence parallelism: Making 4d parallelism possible. arXiv preprint arXiv:2105.13120.
  79. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463.
  80. Xlm-v: Overcoming the vocabulary bottleneck in multilingual masked language models. arXiv preprint arXiv:2301.10472.
  81. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  82. Jurassic-1: Technical details and evaluation. Technical report, AI21 Labs.
  83. Few-shot learning with multilingual language models. ArXiv, abs/2112.10668.
  84. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
  85. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  86. Decoupled weight decay regularization. In International Conference on Learning Representations.
  87. Your transformer may not be as powerful as you expect. arXiv preprint arXiv:2205.13401.
  88. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
  89. Language models of code are few-shot commonsense learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1384–1403.
  90. Suffix arrays: a new method for on-line string searches. Journal on Computing, 22(5):935–948.
  91. Mémoli, F. (2011). Gromov–wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11:417–487.
  92. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP.
  93. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  94. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pages 220–229.
  95. MosaicML (2023). Introducing mpt-30b: Raising the bar for open-source foundation models. Accessed: 2023-06-22.
  96. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264.
  97. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937–7947. PMLR.
  98. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15.
  99. OpenAI (2023a). Gpt-4 technical report. arXiv, pages 2303–08774.
  100. OpenAI (2023b). Model index for researchers. Accessed: 2023-09-26.
  101. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pages 9 – 16, Mannheim. Leibniz-Institut für Deutsche Sprache.
  102. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  103. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534, Berlin, Germany. Association for Computational Linguistics.
  104. Paresh, D. (2023). Stack overflow will charge ai giants for training data.
  105. Automatic differentiation in pytorch. In NIPS-W.
  106. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
  107. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
  108. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
  109. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5.
  110. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations.
  111. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859.
  112. Self-attention does not need o⁢(n2)𝑜superscript𝑛2o(n^{2})italic_o ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory. arXiv preprint arXiv:2112.05682.
  113. Improving language understanding by generative pre-training. OpenAI Blog.
  114. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  115. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
  116. Compressive transformers for long-range sequence modelling.
  117. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683.
  118. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
  119. Scaling up models and data with t5x and seqio. arXiv preprint arXiv:2203.17189.
  120. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  121. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641.
  122. Multitask prompted training enables zero-shot task generalization. ArXiv, abs/2110.08207.
  123. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  124. What language model to train if you have one million gpu hours? arXiv preprint arXiv:2210.15424.
  125. Causes and cures for interference in multilingual translation. arXiv preprint arXiv:2212.07530.
  126. Shannon, C. E. (1951). Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64.
  127. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155.
  128. Shazeer, N. (2019). Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150.
  129. Shazeer, N. (2020). Glu variants improve transformer. arXiv preprint arXiv:2002.05202.
  130. Mesh-TensorFlow: Deep learning for supercomputers. In Neural Information Processing Systems.
  131. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
  132. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.
  133. Dolma.
  134. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  135. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958.
  136. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  137. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864.
  138. Sutton, R. (2019). The bitter lesson. Incomplete Ideas (blog), 13(1).
  139. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  140. Scale efficiently: Insights from pre-training and fine-tuning transformers. ArXiv, abs/2109.10686.
  141. Ul2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations.
  142. Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399.
  143. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
  144. Amos: An adam-style optimizer with adaptive weight decay towards model-oriented scale. arXiv preprint arXiv:2210.11693.
  145. Tiedemann, J. (2016). Finding alternative translations in a large corpus of movie subtitle. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3518–3522, Portorož, Slovenia. European Language Resources Association (ELRA).
  146. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19.
  147. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  148. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  149. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847.
  150. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  151. Will we run out of data? an analysis of the limits of scaling datasets in machine learning. arXiv preprint arXiv:2211.04325.
  152. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
  153. Glue: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
  154. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
  155. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
  156. What language model architecture and pretraining objective work best for zero-shot generalization? arXiv preprint arXiv:2204.05832.
  157. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  158. Emergent abilities of large language models. Transactions on Machine Learning Research.
  159. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  160. Challenges in detoxifying language models. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2447–2469.
  161. Ccnet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4003–4012.
  162. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  163. Effective long-context scaling of foundation models.
  164. To repeat or not to repeat: Insights from scaling llm under token-crisis. arXiv preprint arXiv:2305.13230.
  165. Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306.
  166. mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498.
  167. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466.
  168. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
  169. Pangu-α𝛼\alphaitalic_α: Large-scale autoregressive pretrained chinese language models with auto-parallel computation. arXiv preprint arXiv:2104.12369.
  170. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  171. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  172. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.
  173. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Ebtesam Almazrouei (7 papers)
  2. Hamza Alobeidli (3 papers)
  3. Abdulaziz Alshamsi (1 paper)
  4. Alessandro Cappelli (10 papers)
  5. Ruxandra Cojocaru (6 papers)
  6. Mérouane Debbah (634 papers)
  7. Étienne Goffinet (4 papers)
  8. Daniel Hesslow (12 papers)
  9. Julien Launay (17 papers)
  10. Quentin Malartic (6 papers)
  11. Daniele Mazzotta (1 paper)
  12. Badreddine Noune (3 papers)
  13. Baptiste Pannier (4 papers)
  14. Guilherme Penedo (7 papers)
Citations (306)