BLOOM (language model)

{{Multiple issues|{{primary sources|date=October 2022}}{{promotional|date=October 2022}}}}

BigScience Large Open-science Open-access Multilingual Language Model (BLOOM){{cite web |title=BigScience Large Open-science Open-access Multilingual Language Model |url=https://huggingface.co/bigscience/bloom |access-date=2022-10-01}}{{Cite arXiv|title=BLOOM: A 176B-Parameter Open-Access Multilingual Language Model|eprint=2211.05100|vauthors=Le Scao T, Fan A, Akiki C, Pavlick E, Ilić S, Hesslow D, Castagné R, Luccioni A, Yvon F, Gallé M, Tow J, Rush AM, Biderman S, Webson A, Sasanka Ammanamanchi P, Wang T, Sagot B, Muennighoff N, Villanova del Moral A, Ruwase O, Bawden R, Bekman S, McMillan-Major A, Beltagy I, Nguyen H, Saulnier L, Tan S, Ortiz Suarez P, Sanh V, Laurençon H, Jernite Y, Launay J, Mitchell M, Raffel C, etal|year=2022|class=cs.CL }} is a 176-billion-parameter transformer-based autoregressive large language model (LLM). The model, as well as the code base and the data used to train it, are distributed under free licences.{{cite web | title = The BigScience RAIL license| url=https://bigscience.huggingface.co/blog/the-bigscience-rail-license | access-date=2024-01-10}} BLOOM was trained on approximately 366 billion (1.6TB) tokens from March to July 2022.{{cite web | last=Heikkilä | first=Melissa | title=BLOOM: Inside the radical new project to democratize AI | website=MIT Technology Review | date=2022-07-12 | url=https://www.technologyreview.com/2022/07/12/1055817/inside-a-radical-new-project-to-democratize-ai/ | access-date=2023-12-26}}{{cite web | title=Release of largest trained open-science multilingual language model ever | website=French National Centre for Scientific Research | date=2022-07-12 | url=https://www.cnrs.fr/en/press/release-largest-trained-open-science-multilingual-language-model-ever | access-date=2023-12-26}}

BLOOM is the main outcome of the BigScience collaborative initiative,{{cite web |title=BigScience |url=https://bigscience.huggingface.co |access-date=2024-01-10}} a one-year-long research workshop that took place between May 2021 and May 2022. BigScience was led by HuggingFace and involved several hundreds of researchers and engineers from France and abroad representing both the academia and the private sector. BigScience was supported by a large-scale public compute grant on the French public supercomputer Jean Zay, managed by GENCI and IDRIS (CNRS), on which it was trained.{{Citation needed|date=December 2024}}

BLOOM's training corpus, named ROOTS, combines data extracted from the then-latest version of the web-based OSCAR corpus (38% of ROOTS) and newly collected data extracted from a manually selected and documented list of language data sources. It encompasses 46 natural languages (in amounts ranging from 30% of the whole dataset for English to 0.00002% for Chi Tumbuka) and 13 programming languages.{{Cite arXiv|title=The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset|eprint=2303.03915|vauthors=Laurençon H, Saulnier L, Wang T, Akiki C, Villanova del Moral A, Le Scao T, Von Werra L, Mou C, González Ponferrada C, Nguyen H, Frohberg J, Šaško M, Lhoest Q, McMillan-Major A, Dupont G, Biderman S, Rogers A, Ben allal L, De Toni F, Pistilli G, Nguyen O, Nikpoor S, Masoud M, Colombo P, de la Rosa J, Villegas P, Thrush T, Longpre S, Nagel S, Weber L, Muñoz M, Zhu J, Van Strien D, Alyafeai Z, Almubarak K, Vu MC, Gonzalez-Dios I, Soroa A, Lo K, Dey M, Ortiz Suarez P, Gokaslan A, Bose S, Adelani D, Phan L, Tran H, Yu I, Pai S, Chim J, Lepercq V, Ilic S, Mitchell M, Luccioni S, Jernite Y|year=2022|class=cs.CL }}

External links

[https://bigscience.huggingface.co/ Bigscience project on HuggingFace]

References

Category:Large language models

Category:Generative pre-trained transformers

Category:Open-source artificial intelligence