Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (2211.05100v4)

Published 9 Nov 2022 in cs.CL

Abstract: LLMs have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access LLM designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer LLM that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.

An Expert Overview of BLOOM: A Multilingual LLM

The paper "BLOOM: A 176B-Parameter Open-Access Multilingual LLM" presents a significant contribution to the field of NLP by documenting the development and evaluation of BLOOM. This model, a product of the BigScience collaborative effort, represents a monumental step in making large-scale LLMs accessible to the broader research community.

Overview and Motivation

BLOOM is a 176-billion-parameter multilingual LLM developed through the collaborative efforts of more than a thousand researchers from across the globe, coordinated under the BigScience initiative. Leveraging the compute resources provided by France’s Jean Zay supercomputer, BLOOM aims to democratize access to potentially transformative technologies that have typically been confined to well-resourced organizations.

Dataset and Tokenization

BLOOM's training dataset, ROOTS, comprises a curated collection of 498 datasets encompassing 46 natural languages and 13 programming languages. The paper provides an in-depth look at the rigorous data governance and preprocessing strategies employed to ensure high-quality and diverse training data. The multilingual tokenizer, an important component, was designed to balance fertility across languages, ensuring efficient and effective tokenization without introducing biases that prioritize certain languages over others.

Model Architecture and Training

BLOOM's architecture follows a decoder-only Transformer model, deemed most suitable for zero-shot and few-shot generalization. Empirical ablations within smaller model variants guided architectural choices including the adoption of ALiBi positional embeddings and embedding layer normalization. Such design choices align with industry standards but are tailored to balance stability, performance, and multilingual competency.

Engineering and Training Infrastructure

The model was trained using the Megatron-DeepSpeed framework, offering efficient distributed training through a combination of data, tensor, and pipeline parallelism supported by the ZeRO optimizer. The model training was executed over 3.5 months using a resource-intensive configuration of 384 NVIDIA A100 GPUs, achieving substantial performance metrics while maintaining a manageable environmental footprint—an aspect meticulously documented and compared with other large models.

Evaluation and Multitask Finetuning

Extensive evaluations showcased BLOOM's capabilities across a suite of benchmarks. Results on SuperGLUE, machine translation (MT), summarization (WikiLingua), code generation (HumanEval), embeddings, and multilingual probing (Universal Probing) indicate BLOOM’s competitive performance, particularly after multitask finetuning (BLOOMZ). The model’s strengths and limitations are thoughtfully analyzed, highlighting areas where BLOOM outperforms other models as well as scenarios where its multilingual handling presents unique challenges.

Carbon Footprint and Ethical Implications

The carbon footprint of training BLOOM, estimated around 81 tons of CO2, reflects a conscientious effort to mitigate environmental impact through efficient infrastructure and choice of energy sources. The paper also discusses the broader social implications of LLMs, addressing risks such as language bias and proposing strategies to govern and ethically employ BLOOM in research and application contexts.

Future Directions and Conclusion

The development of BLOOM underlines the value of large-scale, inclusive collaborations in advancing AI research. The authors provide robust documentation and open access to the model and its components to foster further innovations. Ongoing and future research can leverage BLOOM to explore multilingualism in NLP, fine-tuning on specific tasks, and refining methods for mitigating biases and environmental impact.

In conclusion, BLOOM sets a precedent for collaborative and open-access AI research, balancing technical innovation with ethical and environmental stewardship. Its development, grounded in a detailed, transparent, and community-driven approach, paves the way for inclusive advancements in the field of NLP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (171)
  1. Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus. In Harald Lüngen, Marc Kupietz, Piotr Bański, Adrien Barbaresi, Simon Clematide, and Ines Pisetta, editors, Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-9), pages 1–9, Limerick, Ireland, 2021. Leibniz-Institut für Deutsche Sprache. doi: 10.14618/ids-pub-10468. URL https://nbn-resolving.org/urn:nbn:de:bsz:mh39-104688.
  2. Judit Ács. Exploring bert’s vocabulary, 2019. URL http://juditacs.github.io/2019/02/19/bert-tokenization-stats.html.
  3. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In International Conference on Learning Representations (ICLR), April 2017.
  4. BigScience: A Case Study in the Social Construction of a Multilingual Large Language Model, 2022. URL https://arxiv.org/abs/2212.04960.
  5. Character-level language modeling with deeper self-attention. In Proceedings of the AAAI conference on artificial intelligence, 2019.
  6. Masader plus: A new interface for exploring +500 arabic NLP datasets. CoRR, abs/2208.00932, 2022. doi: 10.48550/arXiv.2208.00932. URL https://doi.org/10.48550/arXiv.2208.00932.
  7. Masader: Metadata sourcing for arabic text and speech data resources. CoRR, abs/2110.06744, 2021. URL https://arxiv.org/abs/2110.06744.
  8. PromptSource: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 93–104, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-demo.9. URL https://aclanthology.org/2022.acl-demo.9.
  9. Evaluating the carbon footprint of NLP methods: a survey and analysis of existing tools. In Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing, pages 11–21, Virtual, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.sustainlp-1.2. URL https://aclanthology.org/2021.sustainlp-1.2.
  10. Investigating the translation performance of a large multilingual language model: the case of BLOOM. CoRR, abs/2303.01911, 2023. doi: 10.48550/arXiv.2303.01911. URL https://doi.org/10.48550/arXiv.2303.01911.
  11. DiaBLa: A Corpus of Bilingual Spontaneous Written Dialogues for Machine Translation. Language Resources and Evaluation, pages 635–660, 2020. doi: 10.1007/s10579-020-09514-4. URL https://doi.org/10.1007/s10579-020-09514-4.
  12. Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, March 2022. doi: 10.1162/coli˙a˙00422. URL https://aclanthology.org/2022.cl-1.7.
  13. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72, March 2019. doi: 10.1162/tacl˙a˙00254. URL https://www.aclweb.org/anthology/Q19-1004.
  14. What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 861–872, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1080. URL https://www.aclweb.org/anthology/P17-1080.
  15. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623, 2021.
  16. A neural probabilistic language model. Advances in Neural Information Processing Systems, 2000.
  17. Datasheet for the pile. arXiv preprint arXiv:2201.07311, 2022.
  18. BigScience Workshop. BLOOM (revision 4ab0472), 2022. URL https://huggingface.co/bigscience/bloom.
  19. Multimodal datasets: misogyny, pornography, and malignant stereotypes. ArXiv, abs/2110.01963, 2021.
  20. The values encoded in machine learning research. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 173–184, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393522. doi: 10.1145/3531146.3533083. URL https://doi.org/10.1145/3531146.3533083.
  21. Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow, march 2021. URL https://doi. org/10.5281/zenodo, 5297715.
  22. GPT-NeoX-20B: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
  23. Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1004–1015, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.81. URL https://aclanthology.org/2021.acl-long.81.
  24. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–58, Baltimore, Maryland, USA, June 2014. Association for Computational Linguistics. doi: 10.3115/v1/W14-3302. URL https://aclanthology.org/W14-3302.
  25. J. Scott Brennen. An industry-led debate: how uk media cover artificial intelligence, 2018.
  26. What to expect when you’re expecting robots: Futures, expectations, and pseudo-artificial general intelligence in uk news. Journalism, 23(1):22–38, 2022. doi: 10.1177/1464884920947535. URL https://doi.org/10.1177/1464884920947535.
  27. Language models are few-shot learners. Advances in Neural Information Processing Systems, 2020.
  28. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72, 2022.
  29. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  30. The grammar-learning trajectories of neural language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8281–8297, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.568. URL https://aclanthology.org/2022.acl-long.568.
  31. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  32. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  33. Natural language processing (almost) from scratch. Journal of machine learning research, 12, 2011.
  34. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1198. URL https://aclanthology.org/P18-1198.
  35. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
  36. Behavioral use licensing for responsible ai. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 778–788, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393522. doi: 10.1145/3531146.3533143. URL https://doi.org/10.1145/3531146.3533143.
  37. Entities, dates, and languages: Zero-shot on historical texts with t0. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 75–83, virtual+Dublin, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.bigscience-1.7. URL https://aclanthology.org/2022.bigscience-1.7.
  38. LLM.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  39. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics, 2019.
  40. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Conference on Empirical Methods in Natural Language Processing, 2021.
  41. Probing for semantic evidence of composition by means of simple classification tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages 134–139, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/W16-2524. URL https://www.aclweb.org/anthology/W16-2524.
  42. Beyond English-Centric multilingual machine translation. Journal of Machine Learning Research, 22(107):1–48, 2021. URL http://jmlr.org/papers/v22/20-1307.html.
  43. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
  44. Massive: A 1m-example multilingual natural language understanding dataset with 51 typologically-diverse languages, 2022. URL https://arxiv.org/abs/2204.08582.
  45. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999, 2022.
  46. Dataset debt in biomedical language modeling. In Challenges & Perspectives in Creating Large Language Models, 2022a. URL https://openreview.net/forum?id=HRfzInfr8Z9.
  47. BigBio: A framework for data-centric biomedical natural language processing. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022b. URL https://openreview.net/forum?id=8lQDn9zTQlW.
  48. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=COZDy0WYGg.
  49. Philip Gage. A new algorithm for data compression. C Users J., 12(2):23–38, feb 1994. ISSN 0898-9788.
  50. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  51. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  52. Gemv2: Multilingual nlg benchmarking in a single line of code, 2022a. URL https://arxiv.org/abs/2206.11249.
  53. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text, 2022b. URL https://arxiv.org/abs/2202.06935.
  54. Joshua T. Goodman. A bit of progress in language modeling. Computer Speech & Language, 15(4), 2001.
  55. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538, 2022. doi: 10.1162/tacl˙a˙00474. URL https://aclanthology.org/2022.tacl-1.30.
  56. Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
  57. Hippo: Recurrent memory with optimal polynomial projections. Advances in Neural Information Processing Systems, 33:1474–1487, 2020.
  58. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021.
  59. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
  60. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1275. URL https://aclanthology.org/D19-1275.
  61. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  62. Universal language model fine-tuning for text classification. In Annual Meeting of the Association for Computational Linguistics, 2018.
  63. Visualisation and ’diagnostic classifiers’ reveal how recurrent and recursive neural networks process hierarchical structure. Journal of Artificial Intelligence Research, 61:907–926, 2018.
  64. Data governance in the age of large-scale data-driven language technology. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 2206–2222, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393522. doi: 10.1145/3531146.3534637. URL https://doi.org/10.1145/3531146.3534637.
  65. The ghost in the machine has an american accent: value conflict in gpt-3. ArXiv, abs/2203.07785, 2022.
  66. A study of bfloat16 for deep learning training, 2019.
  67. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  68. What changes can large-scale language models bring? intensive study on HyperCLOVA: Billions-scale korean generative pretrained transformers. In Conference on Empirical Methods in Natural Language Processing, 2021.
  69. Walter Klöpffer. Life cycle assessment. Environmental Science and Pollution Research, 4(4):223–228, 1997.
  70. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. URL https://aclanthology.org/D18-2012.
  71. Ai4bharat-indicnlp corpus: Monolingual corpora and word embeddings for indic languages. ArXiv, abs/2005.00085, 2020.
  72. Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700, 2019.
  73. WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4034–4048, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.360. URL https://aclanthology.org/2020.findings-emnlp.360.
  74. The BigScience ROOTS corpus: A 1.6TB composite multilingual dataset. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=UoEw6KigkUn.
  75. What language model to train if you have one million GPU hours? In Challenges & Perspectives in Creating Large Language Models, 2022. URL https://openreview.net/forum?id=rI7BL3fHIZq.
  76. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Annual Meeting of the Association for Computational Linguistics, 2020.
  77. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-demo.21. URL https://aclanthology.org/2021.emnlp-demo.21.
  78. Competition-level code generation with AlphaCode. CoRR, abs/2203.07814, 2022. doi: 10.48550/arXiv.2203.07814. URL https://doi.org/10.48550/arXiv.2203.07814.
  79. Holistic evaluation of language models, 2022. URL https://arxiv.org/abs/2211.09110.
  80. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
  81. Few-shot learning with multilingual language models, 2021. URL https://arxiv.org/abs/2112.10668.
  82. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  83. S2ORC: The semantic scholar open research corpus. In ACL, 2020.
  84. SGDR: stochastic gradient descent with restarts. CoRR, abs/1608.03983, 2016. URL http://arxiv.org/abs/1608.03983.
  85. Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model. arXiv preprint arXiv:2211.02001, 2022.
  86. Semeval-2022 task 2: Multilingual idiomaticity detection and sentence embedding. arXiv preprint arXiv:2204.10050, 2022.
  87. H Mann and D Whitney. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Ann. Math. Stat, 18(1):50–60, 1947.
  88. CamemBERT: a tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7203–7219, Online, July 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.acl-main.645.
  89. Documenting geographically and contextually diverse data sources: The bigscience catalogue of language data and resources, 2022. URL https://arxiv.org/abs/2201.10066.
  90. Mixed precision training. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1gs9JgRZ.
  91. Between words and characters: A brief history of open-vocabulary modeling and tokenization in nlp, 2021. URL https://arxiv.org/abs/2112.10508.
  92. Natural language processing with modular pdp networks and distributed lexicon. Cognitive Science, 15(3), 1991.
  93. Recurrent neural network based language model. In Interspeech, 2010.
  94. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
  95. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, page 220–229, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450361255. doi: 10.1145/3287560.3287596. URL https://doi.org/10.1145/3287560.3287596.
  96. Hugging face tokenizers library. https://github.com/huggingface/tokenizers, 2019.
  97. Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46–51, 2017.
  98. Niklas Muennighoff. SGPT: GPT sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904, 2022.
  99. MTEB: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316, 2022a.
  100. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022b.
  101. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.154. URL https://aclanthology.org/2020.emnlp-main.154.
  102. Do transformer modifications transfer across implementations and applications? In Conference on Empirical Methods in Natural Language Processing, 2021.
  103. Efficient Large-Scale Language Model Training on GPU Clusters using Megatron-LM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021.
  104. Participatory research for low-resourced machine translation: A case study in African languages. In ACL Findings, 2020.
  105. French CrowS-pairs: Extending a challenge dataset for measuring social bias in masked language models to a language other than English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8521–8531, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.583. URL https://aclanthology.org/2022.acl-long.583.
  106. Universal Dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 1659–1666, Portorož, Slovenia, May 2016. European Language Resources Association (ELRA). URL https://aclanthology.org/L16-1262.
  107. Universal Dependencies. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts, Valencia, Spain, April 2017. Association for Computational Linguistics. URL https://aclanthology.org/E17-5001.
  108. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. In Piotr Bański, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald Lüngen, and Caroline Iliadi, editors, Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7), pages 9 – 16, Cardiff, UK, 2019. Leibniz-Institut für Deutsche Sprache. doi: 10.14618/ids-pub-9021. URL http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215.
  109. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://aclanthology.org/P02-1040.
  110. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350, 2021.
  111. Karl Pearson. Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58(347-352):240–242, 1895.
  112. Deep contextualized word representations. In Conference of the North American Chapter of the Association for Computational Linguistics, 2018.
  113. EleutherAI: going beyond ”open science” to ”science in the open”. In Workshop on Broadening Research Collaborations, 2022.
  114. Matt Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-6319. URL https://aclanthology.org/W18-6319.
  115. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2021.
  116. Improving language understanding by generative pre-training, 2018.
  117. Language models are unsupervised multitask learners, 2019.
  118. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  119. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
  120. ZeRO: Memory optimizations toward training trillion parameter models. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2020. doi: 10.1109/sc41405.2020.00024. URL http://dx.doi.org/10.1109/SC41405.2020.00024.
  121. Ai and the everything in the whole wide world benchmark. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/084b6fbb10729ed4da8c3d3f5a3ae7c9-Paper-round2.pdf.
  122. The fallacy of AI functionality. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 959–972, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393522. doi: 10.1145/3531146.3533158. URL https://doi.org/10.1145/3531146.3533158.
  123. DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450379984. doi: 10.1145/3394486.3406703. URL https://doi.org/10.1145/3394486.3406703.
  124. How good is your tokenizer? on the monolingual performance of multilingual language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3118–3135, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.243. URL https://aclanthology.org/2021.acl-long.243.
  125. KUISAIL at SemEval-2020 task 12: BERT-CNN for offensive speech identification in social media. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 2054–2059, Barcelona (online), December 2020. International Committee for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.semeval-1.271.
  126. On the specification of term values in automatic indexing. Journal of documentation, 1973.
  127. “everyone wants to do the model work, not the data work”: Data cascades in high-stakes ai. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380966. doi: 10.1145/3411764.3445518. URL https://doi.org/10.1145/3411764.3445518.
  128. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4.
  129. Sequential neural text compression. IEEE Transactions on Neural Networks, 7(1), 1996.
  130. Green ai. Communications of the ACM, 63(12), 2020.
  131. Universal and independent: Multilingual probing framework for exhaustive model interpretation and evaluation. arXiv preprint arXiv:2210.13236, 2022.
  132. Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3), 1948.
  133. Noam Shazeer. GLU variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  134. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=B1ckMDqlg.
  135. mgpt: Few-shot learners go multilingual. arXiv preprint arXiv:2204.07580, 2022.
  136. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  137. Un modèle Transformer Génératif Pré-entrainé pour le ______ français. In Pascal Denis, Natalia Grabar, Amel Fraisse, Rémi Cardon, Bernard Jacquemin, Eric Kergosien, and Antonio Balvet, editors, Traitement Automatique des Langues Naturelles, pages 246–255, Lille, France, 2021. ATALA. URL https://hal.archives-ouvertes.fr/hal-03265900.
  138. Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  139. Alexatm 20b: Few-shot learning using a large-scale multilingual seq2seq model, 2022. URL https://arxiv.org/abs/2208.01448.
  140. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  141. Energy and policy considerations for deep learning in nlp. In Annual Meeting of the Association for Computational Linguistics, 2019.
  142. RoFormer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  143. Generating text with recurrent neural networks. In International Conference on Machine Learning, 2011.
  144. You reap what you sow: On the challenges of bias evaluation under multilingual settings. In Challenges & Perspectives in Creating Large Language Models, 2022. URL https://openreview.net/forum?id=rK-7NhfSIW5.
  145. Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399, 2022.
  146. Emergent structures and training dynamics in large language models. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 146–159, virtual+Dublin, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.bigscience-1.11. URL https://aclanthology.org/2022.bigscience-1.11.
  147. What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations, 2018.
  148. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  149. A neural conversational model. arXiv preprint arXiv:1506.05869, 2015.
  150. Is neural language acquisition similar to natural? a chronological probing study. arXiv preprint arXiv:2207.00560, 2022.
  151. Superglue: A stickier benchmark for general-purpose language understanding systems. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf.
  152. GPT-J-6B: A 6 billion parameter autoregressive language model, 2021.
  153. Neural machine translation with byte-level subwords. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020.
  154. Bfloat16: The secret to high performance on cloud tpus, 2019. URL https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus.
  155. Ernie 3.0 titan: Exploring larger-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2112.12731, 2021.
  156. What language model architecture and pretraining objective works best for zero-shot generalization? In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 22964–22984. PMLR, 17–23 Jul 2022a. URL https://proceedings.mlr.press/v162/wang22u.html.
  157. Benchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv preprint arXiv:2204.07705, 2022b.
  158. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  159. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
  160. Faces of Environmental Racism: Confronting Issues of Global Justice. Rowman & Littlefield Publishers, 2001.
  161. Langdon Winner. Technology as master. (book reviews: Autonomous technology. technics-out-of-control as a theme in political thought). Science, 1977.
  162. Langdon Winner. Do artifacts have politics? In Computer Ethics, pages 177–192. Routledge, 2017.
  163. External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients. JAMA Internal Medicine, 181(8):1065–1070, 08 2021. ISSN 2168-6106. doi: 10.1001/jamainternmed.2021.2626. URL https://doi.org/10.1001/jamainternmed.2021.2626.
  164. Optimizing data warehousing applications for GPUs using kernel fusion/fission. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops and PhD Forum, pages 2433–2442, 2012. doi: 10.1109/IPDPSW.2012.300.
  165. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. URL https://aclanthology.org/2021.naacl-main.41.
  166. XLnet: Generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems, 2019.
  167. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  168. PanGu-α𝛼\alphaitalic_α: Large-scale autoregressive pretrained Chinese language models with auto-parallel computation. arXiv preprint arXiv:2104.12369, 2021.
  169. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  170. When do you need billions of words of pretraining data? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1112–1125, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.90. URL https://aclanthology.org/2021.acl-long.90.
  171. ERNIE: Enhanced language representation with informative entities. In Annual Meeting of the Association for Computational Linguistics, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (394)
  1. BigScience Workshop (1 paper)
  2. : (643 papers)
  3. Teven Le Scao (18 papers)
  4. Angela Fan (49 papers)
  5. Christopher Akiki (15 papers)
  6. Ellie Pavlick (66 papers)
  7. Suzana Ilić (10 papers)
  8. Daniel Hesslow (12 papers)
  9. Roman Castagné (4 papers)
  10. Alexandra Sasha Luccioni (25 papers)
  11. François Yvon (49 papers)
  12. Matthias Gallé (31 papers)
  13. Jonathan Tow (7 papers)
  14. Alexander M. Rush (115 papers)
  15. Stella Biderman (55 papers)
  16. Albert Webson (19 papers)
  17. Pawan Sasanka Ammanamanchi (8 papers)
  18. Thomas Wang (17 papers)
  19. Benoît Sagot (60 papers)
  20. Niklas Muennighoff (56 papers)
Citations (2,096)
Youtube Logo Streamline Icon: https://streamlinehq.com