Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

1 3

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (2211.05100v4)

Published 9 Nov 2022 in cs.CL

Abstract: LLMs have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access LLM designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer LLM that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.

PDF HTML Abstract

An Expert Overview of BLOOM: A Multilingual LLM

The paper "BLOOM: A 176B-Parameter Open-Access Multilingual LLM" presents a significant contribution to the field of NLP by documenting the development and evaluation of BLOOM. This model, a product of the BigScience collaborative effort, represents a monumental step in making large-scale LLMs accessible to the broader research community.

Overview and Motivation

BLOOM is a 176-billion-parameter multilingual LLM developed through the collaborative efforts of more than a thousand researchers from across the globe, coordinated under the BigScience initiative. Leveraging the compute resources provided by France’s Jean Zay supercomputer, BLOOM aims to democratize access to potentially transformative technologies that have typically been confined to well-resourced organizations.

Dataset and Tokenization

BLOOM's training dataset, ROOTS, comprises a curated collection of 498 datasets encompassing 46 natural languages and 13 programming languages. The paper provides an in-depth look at the rigorous data governance and preprocessing strategies employed to ensure high-quality and diverse training data. The multilingual tokenizer, an important component, was designed to balance fertility across languages, ensuring efficient and effective tokenization without introducing biases that prioritize certain languages over others.

Model Architecture and Training

BLOOM's architecture follows a decoder-only Transformer model, deemed most suitable for zero-shot and few-shot generalization. Empirical ablations within smaller model variants guided architectural choices including the adoption of ALiBi positional embeddings and embedding layer normalization. Such design choices align with industry standards but are tailored to balance stability, performance, and multilingual competency.

Engineering and Training Infrastructure

The model was trained using the Megatron-DeepSpeed framework, offering efficient distributed training through a combination of data, tensor, and pipeline parallelism supported by the ZeRO optimizer. The model training was executed over 3.5 months using a resource-intensive configuration of 384 NVIDIA A100 GPUs, achieving substantial performance metrics while maintaining a manageable environmental footprint—an aspect meticulously documented and compared with other large models.

Evaluation and Multitask Finetuning

Extensive evaluations showcased BLOOM's capabilities across a suite of benchmarks. Results on SuperGLUE, machine translation (MT), summarization (WikiLingua), code generation (HumanEval), embeddings, and multilingual probing (Universal Probing) indicate BLOOM’s competitive performance, particularly after multitask finetuning (BLOOMZ). The model’s strengths and limitations are thoughtfully analyzed, highlighting areas where BLOOM outperforms other models as well as scenarios where its multilingual handling presents unique challenges.

Carbon Footprint and Ethical Implications

The carbon footprint of training BLOOM, estimated around 81 tons of CO2, reflects a conscientious effort to mitigate environmental impact through efficient infrastructure and choice of energy sources. The paper also discusses the broader social implications of LLMs, addressing risks such as language bias and proposing strategies to govern and ethically employ BLOOM in research and application contexts.

Future Directions and Conclusion

The development of BLOOM underlines the value of large-scale, inclusive collaborations in advancing AI research. The authors provide robust documentation and open access to the model and its components to foster further innovations. Ongoing and future research can leverage BLOOM to explore multilingualism in NLP, fine-tuning on specific tasks, and refining methods for mitigating biases and environmental impact.

In conclusion, BLOOM sets a precedent for collaborative and open-access AI research, balancing technical innovation with ethical and environmental stewardship. Its development, grounded in a detailed, transparent, and community-driven approach, paves the way for inclusive advancements in the field of NLP.

PDF Markdown Bookmark Chat (Pro)

References (171)

Authors (394)

BigScience Workshop (1 paper)
: (643 papers)
Teven Le Scao (18 papers)
Angela Fan (49 papers)
Christopher Akiki (15 papers)
Ellie Pavlick (66 papers)
Suzana Ilić (10 papers)
Daniel Hesslow (12 papers)
Roman Castagné (4 papers)
Alexandra Sasha Luccioni (25 papers)
François Yvon (49 papers)
Matthias Gallé (31 papers)
Jonathan Tow (7 papers)
Alexander M. Rush (115 papers)
Stella Biderman (55 papers)
Albert Webson (19 papers)
Pawan Sasanka Ammanamanchi (8 papers)
Thomas Wang (17 papers)
Benoît Sagot (60 papers)
Niklas Muennighoff (56 papers)

Citations (2,096)

View on Semantic Scholar

Tweets

https://twitter.com/tummycom/status/1772359036903567731

https://twitter.com/A_Redhwan/status/1771267323401273543

https://twitter.com/A_Redhwan/status/1771265157827584072

https://twitter.com/lintangsutawika/status/1883227443009647053

https://twitter.com/54xuec/status/1870494677696483834

YouTube

Show All Videos