list of large language models

A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.

This page lists notable large language models.

List

For the training cost column, 1 petaFLOP-day = 1 petaFLOP/sec × 1 day = 8.64E19 FLOP. Also, only the largest model's cost is written.

Name	Release date{{efn\|This is the date that documentation describing the model's architecture was first released.}}	Developer	Number of parameters (billion) {{efn\|In many cases, researchers release or report on multiple versions of a model having different sizes. In these cases, the size of the largest model is listed here.}}	Corpus size !Training cost (petaFLOP-day)	License{{efn\|This is the license of the pre-trained model weights. In almost all cases the training code itself is open-source or can be easily replicated.}}	Notes
class="wikitable sortable sort-under col2right col4right col5right col6right" style="font-size:smaller"
Attention Is All You Need \|{{dts\|2017-06}} \|Vaswani et al at Google \|0.213 \|36 million English-French sentence pairs \|0.09{{Cite web \|date=2022-06-09 \|title=AI and compute \|url=https://openai.com/index/ai-and-compute/ \|access-date=2025-04-24 \|website=openai.com \|language=en-US}} \| \|Trained for 0.3M steps on 8 NVIDIA P100 GPUs.
GPT-1	{{dts\|2018-06}}	OpenAI	{{sort\|0.117\|0.117}}	\| 1{{cite web \|date=June 11, 2018 \|title=Improving language understanding with unsupervised learning \|url=https://openai.com/research/language-unsupervised \|url-status=live \|archive-url=https://web.archive.org/web/20230318210736/https://openai.com/research/language-unsupervised \|archive-date=2023-03-18 \|access-date=2023-03-18 \|website=openai.com }}	{{yes\|MIT}}{{cite web\|work=GitHub\|title=finetune-transformer-lm\|url=https://github.com/openai/finetune-transformer-lm\|access-date=2 January 2024\|archive-date=19 May 2023\|archive-url=https://web.archive.org/web/20230519062127/https://github.com/openai/finetune-transformer-lm\|url-status=live}} \| First GPT model, decoder-only transformer. Trained for 30 days on 8 P600 GPUs.
BERT	{{dts\|2018-10}}	Google	{{sort\|0.340\|0.340}}{{cite arXiv \|last1=Devlin \|first1=Jacob \|last2=Chang \|first2=Ming-Wei \|last3=Lee \|first3=Kenton \|last4=Toutanova \|first4=Kristina \|title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding \|date=11 October 2018 \|eprint=1810.04805v2\|class=cs.CL }}	{{sort\|3300000000\|3.3 billion}} words \|{{sort\|9\|9}}{{Cite web \|last=Prickett \|first=Nicole Hemsoth \|date=2021-08-24 \|title=Cerebras Shifts Architecture To Meet Massive AI/ML Models \|url=https://www.nextplatform.com/2021/08/24/cerebras-shifts-architecture-to-meet-massive-ai-ml-models/ \|access-date=2023-06-20 \|website=The Next Platform \|archive-date=2023-06-20 \|archive-url=https://web.archive.org/web/20230620151619/https://www.nextplatform.com/2021/08/24/cerebras-shifts-architecture-to-meet-massive-ai-ml-models/ \|url-status=live }}	{{yes\|Apache 2.0}}{{Cite web\|url=https://github.com/google-research/bert\|title=BERT\|date=March 13, 2023\|via=GitHub\|access-date=March 13, 2023\|archive-date=January 13, 2021\|archive-url=https://web.archive.org/web/20210113211317/https://github.com/google-research/bert\|url-status=live}} \| An early and influential language model.{{cite journal \|last=Manning \|first=Christopher D. \|author-link=Christopher D. Manning \|year=2022 \|title=Human Language Understanding & Reasoning \|url=https://www.amacad.org/publication/human-language-understanding-reasoning \|journal=Daedalus \|volume=151 \|issue=2 \|pages=127–138 \|doi=10.1162/daed_a_01905 \|s2cid=248377870 \|doi-access=free \|access-date=2023-03-09 \|archive-date=2023-11-17 \|archive-url=https://web.archive.org/web/20231117205531/https://www.amacad.org/publication/human-language-understanding-reasoning \|url-status=live }}Encoder-only and thus not built to be prompted or generative.{{cite arXiv \|last1=Patel \|first1=Ajay \|last2=Li \|first2=Bryan \|last3=Rasooli \|first3=Mohammad Sadegh \|last4=Constant \|first4=Noah \|last5=Raffel \|first5=Colin \|last6=Callison-Burch \|first6=Chris \|title=Bidirectional Language Models Are Also Few-shot Learners \|date=2022 \|class=cs.LG \|eprint=2209.14500}} Training took 4 days on 64 TPUv2 chips.{{cite arXiv \|eprint=1810.04805v2 \|class=cs.CL \|first1=Jacob \|last1=Devlin \|first2=Ming-Wei \|last2=Chang \|title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding \|date=11 October 2018 \|last3=Lee \|first3=Kenton \|last4=Toutanova \|first4=Kristina}}
T5 \|{{dts\|2019-10}} \|Google \|{{sort\|11\|11}}{{Cite journal \|last1=Raffel \|first1=Colin \|last2=Shazeer \|first2=Noam \|last3=Roberts \|first3=Adam \|last4=Lee \|first4=Katherine \|last5=Narang \|first5=Sharan \|last6=Matena \|first6=Michael \|last7=Zhou \|first7=Yanqi \|last8=Li \|first8=Wei \|last9=Liu \|first9=Peter J. \|date=2020 \|title=Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer \|url=http://jmlr.org/papers/v21/20-074.html \|journal=Journal of Machine Learning Research \|volume=21 \|issue=140 \|pages=1–67 \|arxiv=1910.10683 \|issn=1533-7928}} \|34 billion tokens \| \| {{yes\|Apache 2.0}}{{Citation \|title=google-research/text-to-text-transfer-transformer \|date=2024-04-02 \|url=https://github.com/google-research/text-to-text-transfer-transformer \|access-date=2024-04-04 \|publisher=Google Research \|archive-date=2024-03-29 \|archive-url=https://web.archive.org/web/20240329112957/https://github.com/google-research/text-to-text-transfer-transformer \|url-status=live }} \|Base model for many Google projects, such as Imagen.{{Cite web \|title=Imagen: Text-to-Image Diffusion Models \|url=https://imagen.research.google/ \|access-date=2024-04-04 \|website=imagen.research.google \|archive-date=2024-03-27 \|archive-url=https://web.archive.org/web/20240327201713/https://imagen.research.google/ \|url-status=live }}
XLNet	{{dts\|2019-06}}	Google	{{sort\|0.340\|0.340}}{{Cite web \|title=Pretrained models — transformers 2.0.0 documentation \|url=https://huggingface.co/transformers/v2.0.0/pretrained_models.html \|access-date=2024-08-05 \|website=huggingface.co \|archive-date=2024-08-05 \|archive-url=https://web.archive.org/web/20240805032110/https://huggingface.co/transformers/v2.0.0/pretrained_models.html \|url-status=live }}	{{sort\|3300000000\|33}} billion words \| 330	{{yes\|Apache 2.0}}{{cite web\|work=GitHub\|title=xlnet\|url=https://github.com/zihangdai/xlnet/\|access-date=2 January 2024\|archive-date=2 January 2024\|archive-url=https://web.archive.org/web/20240102191842/https://github.com/zihangdai/xlnet/\|url-status=live}} \| An alternative to BERT; designed as encoder-only. Trained on 512 TPU v3 chips for 5.5 days.{{cite arXiv \|last1=Yang \|first1=Zhilin \|last2=Dai \|first2=Zihang \|last3=Yang \|first3=Yiming \|last4=Carbonell \|first4=Jaime \|last5=Salakhutdinov \|first5=Ruslan \|last6=Le \|first6=Quoc V. \|title=XLNet: Generalized Autoregressive Pretraining for Language Understanding \|date=2 January 2020 \|class=cs.CL \|eprint=1906.08237}}
GPT-2	{{dts\|2019-02}}	OpenAI	{{sort\|1.5\|1.5}}{{Cite web \|url = https://openai.com/blog/gpt-2-1-5b-release/ \|title = GPT-2: 1.5B Release \|date = 2019-11-05 \|website = OpenAI \|language = en \|access-date = 2019-11-14 \|archive-date = 2019-11-14 \|archive-url = https://web.archive.org/web/20191114074358/https://openai.com/blog/gpt-2-1-5b-release/ \|url-status = live}}	40GB{{cite web \|title=Better language models and their implications \|url=https://openai.com/research/better-language-models \|website=openai.com \|access-date=2023-03-13 \|archive-date=2023-03-16 \|archive-url=https://web.archive.org/web/20230316160730/https://openai.com/research/better-language-models \|url-status=live }} (~{{sort\|10000000000\|10 billion}} tokens){{cite web \|title=OpenAI's GPT-3 Language Model: A Technical Overview \|url=https://lambdalabs.com/blog/demystifying-gpt-3 \|website=lambdalabs.com \|date=3 June 2020 \|access-date=13 March 2023 \|archive-date=27 March 2023 \|archive-url=https://web.archive.org/web/20230327213811/https://lambdalabs.com/blog/demystifying-gpt-3 \|url-status=live }} \| 28{{Cite web \|title=openai-community/gpt2-xl · Hugging Face \|url=https://huggingface.co/openai-community/gpt2-xl \|access-date=2024-07-24 \|website=huggingface.co \|archive-date=2024-07-24 \|archive-url=https://web.archive.org/web/20240724041702/https://huggingface.co/openai-community/gpt2-xl \|url-status=live }}	{{yes\|MIT}}{{cite web\|work=GitHub\|title=gpt-2\|url=https://github.com/openai/gpt-2\|access-date=13 March 2023\|archive-date=11 March 2023\|archive-url=https://web.archive.org/web/20230311154936/https://github.com/openai/gpt-2\|url-status=live}} \| Trained on 32 TPUv3 chips for 1 week.
GPT-3	{{dts\|2020-05}}	OpenAI	{{sort\|175\|175}}{{cite web \|last=Wiggers \|first=Kyle \|date=28 April 2022 \|title=The emerging types of language models and why they matter \|url=https://techcrunch.com/2022/04/28/the-emerging-types-of-language-models-and-why-they-matter/ \|work=TechCrunch \|access-date=9 March 2023 \|archive-date=16 March 2023 \|archive-url=https://web.archive.org/web/20230316072443/https://techcrunch.com/2022/04/28/the-emerging-types-of-language-models-and-why-they-matter/ \|url-status=live }}	{{sort\|300000000000\|300 billion}} tokens \|3640Table D.1 in {{Cite arXiv \|last1=Brown \|first1=Tom B. \|last2=Mann \|first2=Benjamin \|last3=Ryder \|first3=Nick \|last4=Subbiah \|first4=Melanie \|last5=Kaplan \|first5=Jared \|last6=Dhariwal \|first6=Prafulla \|last7=Neelakantan \|first7=Arvind \|last8=Shyam \|first8=Pranav \|last9=Sastry \|first9=Girish \|last10=Askell \|first10=Amanda \|last11=Agarwal \|first11=Sandhini \|last12=Herbert-Voss \|first12=Ariel \|last13=Krueger \|first13=Gretchen \|last14=Henighan \|first14=Tom \|last15=Child \|first15=Rewon \|date=May 28, 2020 \|title=Language Models are Few-Shot Learners \|eprint=2005.14165v4 \|first16=Aditya \|last16=Ramesh \|first17=Daniel M. \|last17=Ziegler \|first18=Jeffrey \|last18=Wu \|first19=Clemens \|last19=Winter \|first20=Christopher \|last20=Hesse \|first21=Mark \|last21=Chen \|first22=Eric \|last22=Sigler \|first23=Mateusz \|last23=Litwin \|first24=Scott \|last24=Gray \|first25=Benjamin \|last25=Chess \|first26=Jack \|last26=Clark \|first27=Christopher \|last27=Berner \|first28=Sam \|last28=McCandlish \|first29=Alec \|last29=Radford \|first30=Ilya \|last30=Sutskever \|first31=Dario \|last31=Amodei\|class=cs.CL}}	{{proprietary}} \| A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.{{Cite web \|date=2022-11-30 \|title=ChatGPT: Optimizing Language Models for Dialogue \|url=https://openai.com/blog/chatgpt/ \|access-date=2023-01-13 \|website=OpenAI \|archive-date=2022-11-30 \|archive-url=https://web.archive.org/web/20221130180912/https://openai.com/blog/chatgpt/ \|url-status=live}}
GPT-Neo	{{dts\|2021-03}}	EleutherAI	{{sort\|2.7\|2.7}}{{Cite web\|url=https://github.com/EleutherAI/gpt-neo\|title=GPT Neo\|date=March 15, 2023\|via=GitHub\|access-date=March 12, 2023\|archive-date=March 12, 2023\|archive-url=https://web.archive.org/web/20230312225202/https://github.com/EleutherAI/gpt-neo\|url-status=live}}	825 GiB{{cite arXiv \|last1=Gao \|first1=Leo \|last2=Biderman \|first2=Stella \|last3=Black \|first3=Sid \|last4=Golding \|first4=Laurence \|last5=Hoppe \|first5=Travis \|last6=Foster \|first6=Charles \|last7=Phang \|first7=Jason \|last8=He \|first8=Horace \|last9=Thite \|first9=Anish \|last10=Nabeshima \|first10=Noa \|last11=Presser \|first11=Shawn \|last12=Leahy \|first12=Connor \|title=The Pile: An 800GB Dataset of Diverse Text for Language Modeling \|eprint=2101.00027\|date=31 December 2020 \|class=cs.CL}} \|	{{yes\|MIT}} \| The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.{{cite web \|work=VentureBeat \|last=Iyer \|first=Abhishek \|title=GPT-3's free alternative GPT-Neo is something to be excited about \|date=15 May 2021 \|url=https://venturebeat.com/ai/gpt-3s-free-alternative-gpt-neo-is-something-to-be-excited-about/ \|access-date=13 March 2023 \|archive-date=9 March 2023 \|archive-url=https://web.archive.org/web/20230309012717/https://venturebeat.com/ai/gpt-3s-free-alternative-gpt-neo-is-something-to-be-excited-about/ \|url-status=live}}
GPT-J	{{dts\|2021-06}}	EleutherAI	{{sort\|6\|6}}{{Cite web \|title=GPT-J-6B: An Introduction to the Largest Open Source GPT Model {{!}} Forefront \|url=https://www.forefront.ai/blog-posts/gpt-j-6b-an-introduction-to-the-largest-open-sourced-gpt-model \|access-date=2023-02-28 \|website=www.forefront.ai \|archive-date=2023-03-09 \|archive-url=https://web.archive.org/web/20230309205439/https://www.forefront.ai/blog-posts/gpt-j-6b-an-introduction-to-the-largest-open-sourced-gpt-model \|url-status=dead }}	825 GiB \|200{{Cite arXiv \|last1=Dey \|first1=Nolan \|last2=Gosal \|first2=Gurpreet \|last3=Zhiming \|last4=Chen \|last5=Khachane \|first5=Hemant \|last6=Marshall \|first6=William \|last7=Pathria \|first7=Ribhu \|last8=Tom \|first8=Marvin \|last9=Hestness \|first9=Joel \|date=2023-04-01 \|title=Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster \|class=cs.LG \|eprint=2304.03208}}	{{yes\|Apache 2.0}} \| GPT-3-style language model
Megatron-Turing NLG	{{dts\|2021-10}} {{cite web \|last1=Alvi \|first1=Ali \|last2=Kharya \|first2=Paresh \|title=Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World's Largest and Most Powerful Generative Language Model \|url=https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/ \|website=Microsoft Research \|date=11 October 2021 \|access-date=13 March 2023 \|archive-date=13 March 2023 \|archive-url=https://web.archive.org/web/20230313180531/https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/ \|url-status=live }}	Microsoft and Nvidia	{{sort\|530\|530}}{{Cite arXiv \|last1=Smith \|first1=Shaden \|last2=Patwary \|first2=Mostofa \|last3=Norick \|first3=Brandon \|last4=LeGresley \|first4=Patrick \|last5=Rajbhandari \|first5=Samyam \|last6=Casper \|first6=Jared \|last7=Liu \|first7=Zhun \|last8=Prabhumoye \|first8=Shrimai \|last9=Zerveas \|first9=George \|last10=Korthikanti \|first10=Vijay \|last11=Zhang \|first11=Elton \|last12=Child \|first12=Rewon \|last13=Aminabadi \|first13=Reza Yazdani \|last14=Bernauer \|first14=Julie \|last15=Song \|first15=Xia \|date=2022-02-04 \|title=Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model \|class=cs.CL \|eprint=2201.11990}}	{{sort\|338600000000\|338.6 billion}} tokens \| 38000	{{no\|Restricted web access}} \| Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours.{{Citation \|last1=Rajbhandari \|first1=Samyam \|title=DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale \|date=2022-07-21 \|arxiv=2201.05596 \|last2=Li \|first2=Conglong \|last3=Yao \|first3=Zhewei \|last4=Zhang \|first4=Minjia \|last5=Aminabadi \|first5=Reza Yazdani \|last6=Awan \|first6=Ammar Ahmad \|last7=Rasley \|first7=Jeff \|last8=He \|first8=Yuxiong}}
Ernie 3.0 Titan	{{dts\|2021-12}}	Baidu	{{sort\|260\|260}}{{Cite arXiv\|title=ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation\|first1=Shuohuan\|last1=Wang\|first2=Yu\|last2=Sun\|first3=Yang\|last3=Xiang\|first4=Zhihua\|last4=Wu\|first5=Siyu\|last5=Ding\|first6=Weibao\|last6=Gong\|first7=Shikun\|last7=Feng\|first8=Junyuan\|last8=Shang\|first9=Yanbin\|last9=Zhao\|first10=Chao\|last10=Pang\|first11=Jiaxiang\|last11=Liu\|first12=Xuyi\|last12=Chen\|first13=Yuxiang\|last13=Lu\|first14=Weixin\|last14=Liu\|first15=Xi\|last15=Wang\|first16=Yangfan\|last16=Bai\|first17=Qiuliang\|last17=Chen\|first18=Li\|last18=Zhao\|first19=Shiyong\|last19=Li\|first20=Peng\|last20=Sun\|first21=Dianhai\|last21=Yu\|first22=Yanjun\|last22=Ma\|first23=Hao\|last23=Tian\|first24=Hua\|last24=Wu\|first25=Tian\|last25=Wu\|first26=Wei\|last26=Zeng\|first27=Ge\|last27=Li\|first28=Wen\|last28=Gao\|first29=Haifeng\|last29=Wang\|date=December 23, 2021\|class=cs.CL \|eprint=2112.12731}}	4 Tb \|	{{proprietary}} \| Chinese-language LLM. Ernie Bot is based on this model.
Claude{{cite web \|title=Product \|url=https://www.anthropic.com/product \|website=Anthropic \|access-date=14 March 2023 \|archive-date=16 March 2023 \|archive-url=https://web.archive.org/web/20230316145444/https://www.anthropic.com/product \|url-status=live }}	{{dts\|2021-12}}	Anthropic	{{sort\|52\|52}}{{cite arXiv \|last1=Askell \|first1=Amanda \|last2=Bai \|first2=Yuntao \|last3=Chen \|first3=Anna \|last4=Drain \|first4=Dawn \|last5=Ganguli \|first5=Deep \|last6=Henighan \|first6=Tom \|last7=Jones \|first7=Andy \|last8=Joseph \|first8=Nicholas \|last9=Mann \|first9=Ben \|last10=DasSarma \|first10=Nova \|last11=Elhage \|first11=Nelson \|last12=Hatfield-Dodds \|first12=Zac \|last13=Hernandez \|first13=Danny \|last14=Kernion \|first14=Jackson \|last15=Ndousse \|first15=Kamal \|last16=Olsson \|first16=Catherine \|last17=Amodei \|first17=Dario \|last18=Brown \|first18=Tom \|last19=Clark \|first19=Jack \|last20=McCandlish \|first20=Sam \|last21=Olah \|first21=Chris \|last22=Kaplan \|first22=Jared \|display-authors=3 \|title=A General Language Assistant as a Laboratory for Alignment \|eprint=2112.00861 \|date=9 December 2021 \|class=cs.CL}}	{{sort\|400000000000\|400 billion}} tokens \|	{{partial success\|beta}} \| Fine-tuned for desirable behavior in conversations.{{cite arXiv \|last1=Bai \|first1=Yuntao \|last2=Kadavath \|first2=Saurav \|last3=Kundu \|first3=Sandipan \|last4=Askell \|first4=Amanda \|last5=Kernion \|first5=Jackson \|last6=Jones \|first6=Andy \|last7=Chen \|first7=Anna \|last8=Goldie \|first8=Anna \|last9=Mirhoseini \|first9=Azalia \|last10=McKinnon \|first10=Cameron \|last11=Chen \|first11=Carol \|last12=Olsson \|first12=Catherine \|last13=Olah \|first13=Christopher \|last14=Hernandez \|first14=Danny \|last15=Drain \|first15=Dawn \|last16=Ganguli \|first16=Deep \|last17=Li \|first17=Dustin \|last18=Tran-Johnson \|first18=Eli \|last19=Perez \|first19=Ethan \|last20=Kerr \|first20=Jamie \|last21=Mueller \|first21=Jared \|last22=Ladish \|first22=Jeffrey \|last23=Landau \|first23=Joshua \|last24=Ndousse \|first24=Kamal \|last25=Lukosuite \|first25=Kamile \|last26=Lovitt \|first26=Liane \|last27=Sellitto \|first27=Michael \|last28=Elhage \|first28=Nelson \|last29=Schiefer \|first29=Nicholas \|last30=Mercado \|first30=Noemi \|last31=DasSarma \|first31=Nova \|last32=Lasenby \|first32=Robert \|last33=Larson \|first33=Robin \|last34=Ringer \|first34=Sam \|last35=Johnston \|first35=Scott \|last36=Kravec \|first36=Shauna \|last37=Showk \|first37=Sheer El \|last38=Fort \|first38=Stanislav \|last39=Lanham \|first39=Tamera \|last40=Telleen-Lawton \|first40=Timothy \|last41=Conerly \|first41=Tom \|last42=Henighan \|first42=Tom \|last43=Hume \|first43=Tristan \|last44=Bowman \|first44=Samuel R. \|last45=Hatfield-Dodds \|first45=Zac \|last46=Mann \|first46=Ben \|last47=Amodei \|first47=Dario \|last48=Joseph \|first48=Nicholas \|last49=McCandlish \|first49=Sam \|last50=Brown \|first50=Tom \|last51=Kaplan \|first51=Jared \|display-authors=3 \|title=Constitutional AI: Harmlessness from AI Feedback \|eprint=2212.08073 \|date=15 December 2022 \|class=cs.CL}}
GLaM (Generalist Language Model)	{{dts\|2021-12}}	Google	{{sort\|1200\|1200}}	{{sort\|1600000000000\|1.6 trillion}} tokens{{Cite web \|last1=Dai \|first1=Andrew M \|last2=Du \|first2=Nan \|date=December 9, 2021 \|title=More Efficient In-Context Learning with GLaM \|url=https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html \|access-date=2023-03-09 \|website=ai.googleblog.com \|archive-date=2023-03-12 \|archive-url=https://web.archive.org/web/20230312072042/https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html \|url-status=live}} \| 5600	{{proprietary}} \| Sparse mixture of experts model, making it more expensive to train but cheaper to run inference compared to GPT-3.
Gopher	{{dts\|2021-12}}	DeepMind	{{sort\|280\|280}}{{cite web \|title=Language modelling at scale: Gopher, ethical considerations, and retrieval \|url=https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval \|website=www.deepmind.com \|date=8 December 2021 \|access-date=20 March 2023 \|archive-date=20 March 2023 \|archive-url=https://web.archive.org/web/20230320082323/https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval \|url-status=live }}	{{sort\|300000000000\|300 billion}} tokens \|5833Table 20 and page 66 of [https://storage.googleapis.com/pathways-language-model/PaLM-paper.pdf PaLM: Scaling Language Modeling with Pathways] {{Webarchive\|url=https://web.archive.org/web/20230610040050/https://storage.googleapis.com/pathways-language-model/PaLM-paper.pdf \|date=2023-06-10 }}	{{proprietary}} \| Later developed into the Chinchilla model.
LaMDA (Language Models for Dialog Applications)	{{dts\|2022-01}}	Google	{{sort\|137\|137}}	1.56T words,{{Cite web \|last1=Cheng \|first1=Heng-Tze \|last2=Thoppilan \|first2=Romal \|date=January 21, 2022 \|title=LaMDA: Towards Safe, Grounded, and High-Quality Dialog Models for Everything \|url=https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html \|access-date=2023-03-09 \|website=ai.googleblog.com \|archive-date=2022-03-25 \|archive-url=https://web.archive.org/web/20220325014118/https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html \|url-status=live}} {{sort\|168000000000\|168 billion}} tokens \|4110{{Cite arXiv \|last1=Thoppilan \|first1=Romal \|last2=De Freitas \|first2=Daniel \|last3=Hall \|first3=Jamie \|last4=Shazeer \|first4=Noam \|last5=Kulshreshtha \|first5=Apoorv \|last6=Cheng \|first6=Heng-Tze \|last7=Jin \|first7=Alicia \|last8=Bos \|first8=Taylor \|last9=Baker \|first9=Leslie \|last10=Du \|first10=Yu \|last11=Li \|first11=YaGuang \|last12=Lee \|first12=Hongrae \|last13=Zheng \|first13=Huaixiu Steven \|last14=Ghafouri \|first14=Amin \|last15=Menegali \|first15=Marcelo \|date=2022-01-01 \|title=LaMDA: Language Models for Dialog Applications \|class=cs.CL \|eprint=2201.08239}}	{{proprietary}} \| Specialized for response generation in conversations.
GPT-NeoX	{{dts\|2022-02}}	EleutherAI	{{sort\|20\|20}}{{cite conference \|title=GPT-NeoX-20B: An Open-Source Autoregressive Language Model \|conference=Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models \|date=2022-05-01 \|last1=Black \|first1=Sidney \|last2=Biderman \|first2=Stella \|last3=Hallahan \|first3=Eric \|display-authors=etal \|volume=Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models \|pages=95–136 \|url=https://aclanthology.org/2022.bigscience-1.9/ \|access-date=2022-12-19 \|archive-date=2022-12-10 \|archive-url=https://web.archive.org/web/20221210082456/https://aclanthology.org/2022.bigscience-1.9/ \|url-status=live }}	825 GiB \|740	{{yes\|Apache 2.0}} \| based on the Megatron architecture
Chinchilla	{{dts\|2022-03}}	DeepMind	{{sort\|70\|70}}	{{sort\|1400000000000\|1.4 trillion}} tokens{{cite web \|work=Deepmind Blog \|title=An empirical analysis of compute-optimal large language model training \|first1=Jordan \|last1=Hoffmann \|first2=Sebastian \|last2=Borgeaud \|first3=Arthur \|last3=Mensch \|first4=Laurent \|last4=Sifre \|date=12 April 2022 \|url=https://www.deepmind.com/blog/an-empirical-analysis-of-compute-optimal-large-language-model-training \|access-date=9 March 2023 \|archive-date=13 April 2022 \|archive-url=https://web.archive.org/web/20220413014510/https://www.deepmind.com/blog/an-empirical-analysis-of-compute-optimal-large-language-model-training \|url-status=live}}{{cite arXiv \|last1=Hoffmann \|first1=Jordan \|last2=Borgeaud \|first2=Sebastian \|last3=Mensch \|first3=Arthur \|last4=Buchatskaya \|first4=Elena \|last5=Cai \|first5=Trevor \|last6=Rutherford \|first6=Eliza \|last7=Casas \|first7=Diego de Las \|last8=Hendricks \|first8=Lisa Anne \|last9=Welbl \|first9=Johannes \|last10=Clark \|first10=Aidan \|last11=Hennigan \|first11=Tom \|last12=Noland \|first12=Eric \|last13=Millican \|first13=Katie \|last14=Driessche \|first14=George van den \|last15=Damoc \|first15=Bogdan \|last16=Guy \|first16=Aurelia \|last17=Osindero \|first17=Simon \|last18=Simonyan \|first18=Karen \|last19=Elsen \|first19=Erich \|last20=Rae \|first20=Jack W. \|last21=Vinyals \|first21=Oriol \|last22=Sifre \|first22=Laurent \|title=Training Compute-Optimal Large Language Models \|eprint=2203.15556 \|date=29 March 2022 \|class=cs.CL \|display-authors=3}} \|6805	{{proprietary}} \| Reduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law.
PaLM (Pathways Language Model)	{{dts\|2022-04}}	Google	{{sort\|540\|540}}{{Cite web \|last1=Narang \|first1=Sharan \|last2=Chowdhery \|first2=Aakanksha \|date=April 4, 2022 \|title=Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance \|url=https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html \|access-date=2023-03-09 \|website=ai.googleblog.com \|language=en \|archive-date=2022-04-04 \|archive-url=https://web.archive.org/web/20220404161447/https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html \|url-status=live}}	{{sort\|768000000000\|768 billion}} tokens \|{{sort\|29250\|29,250}}	{{proprietary}} \| Trained for ~60 days on ~6000 TPU v4 chips. {{As of\|2024\|October}}, it is the largest dense Transformer published.
OPT (Open Pretrained Transformer)	{{dts\|2022-05}}	Meta	{{sort\|175\|175}}{{cite web \|title=Democratizing access to large-scale language models with OPT-175B \|author1=Susan Zhang \|author2=Mona Diab \|author3=Luke Zettlemoyer \|url=https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/ \|website=ai.facebook.com \|access-date=2023-03-12 \|archive-date=2023-03-12 \|archive-url=https://web.archive.org/web/20230312231820/https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/ \|url-status=live }}	{{sort\|180000000000\|180 billion}} tokens{{cite arXiv \|last1=Zhang \|first1=Susan \|last2=Roller \|first2=Stephen \|last3=Goyal \|first3=Naman \|last4=Artetxe \|first4=Mikel \|last5=Chen \|first5=Moya \|last6=Chen \|first6=Shuohui \|last7=Dewan \|first7=Christopher \|last8=Diab \|first8=Mona \|last9=Li \|first9=Xian \|last10=Lin \|first10=Xi Victoria \|last11=Mihaylov \|first11=Todor \|last12=Ott \|first12=Myle \|last13=Shleifer \|first13=Sam \|last14=Shuster \|first14=Kurt \|last15=Simig \|first15=Daniel \|last16=Koura \|first16=Punit Singh \|last17=Sridhar \|first17=Anjali \|last18=Wang \|first18=Tianlu \|last19=Zettlemoyer \|first19=Luke \|title=OPT: Open Pre-trained Transformer Language Models \|eprint=2205.01068 \|date=21 June 2022\|class=cs.CL}} \|310	{{partial success\|Non-commercial research}}{{efn\|The smaller models including 66B are publicly available, while the 175B model is available on request.}} \| GPT-3 architecture with some adaptations from Megatron. Uniquely, the training logbook written by the team was published.{{Cite web \|title=metaseq/projects/OPT/chronicles at main · facebookresearch/metaseq \|url=https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/chronicles \|access-date=2024-10-18 \|website=GitHub \|language=en}}
YaLM 100B	{{dts\|2022-06}}	Yandex	{{sort\|100\|100}}{{Citation \|last1=Khrushchev \|first1=Mikhail \|title=YaLM 100B \|date=2022-06-22 \|url=https://github.com/yandex/YaLM-100B \|access-date=2023-03-18 \|last2=Vasilev \|first2=Ruslan \|last3=Petrov \|first3=Alexey \|last4=Zinov \|first4=Nikolay \|archive-date=2023-06-16 \|archive-url=https://web.archive.org/web/20230616050056/https://github.com/yandex/YaLM-100B \|url-status=live }}	1.7TB	\|	{{Yes\|Apache 2.0}}	English-Russian model based on Microsoft's Megatron-LM.
Minerva	{{dts\|2022-06}}	Google	{{sort\|540\|540}}	38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server{{cite arXiv \|last1=Lewkowycz \|first1=Aitor \|last2=Andreassen \|first2=Anders \|last3=Dohan \|first3=David \|last4=Dyer \|first4=Ethan \|last5=Michalewski \|first5=Henryk \|last6=Ramasesh \|first6=Vinay \|last7=Slone \|first7=Ambrose \|last8=Anil \|first8=Cem \|last9=Schlag \|first9=Imanol \|last10=Gutman-Solo \|first10=Theo \|last11=Wu \|first11=Yuhuai \|last12=Neyshabur \|first12=Behnam \|last13=Gur-Ari \|first13=Guy \|last14=Misra \|first14=Vedant \|title=Solving Quantitative Reasoning Problems with Language Models \|date=30 June 2022 \|class=cs.CL \|eprint=2206.14858}} \|	{{proprietary}} \| For solving "mathematical and scientific questions using step-by-step reasoning".{{cite web \|title=Minerva: Solving Quantitative Reasoning Problems with Language Models \|url=https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html \|website=ai.googleblog.com \|date=30 June 2022 \|access-date=20 March 2023 }} Initialized from PaLM models, then finetuned on mathematical and scientific data.
BLOOM	{{dts\|2022-07}}	Large collaboration led by Hugging Face	{{sort\|175\|175}}{{cite journal \|journal=Nature \|last=Ananthaswamy \|first=Anil \|title=In AI, is bigger always better? \|date=8 March 2023 \|volume=615 \|issue=7951 \|pages=202–205 \|doi=10.1038/d41586-023-00641-w \|pmid=36890378 \|bibcode=2023Natur.615..202A \|s2cid=257380916 \|url=https://www.nature.com/articles/d41586-023-00641-w \|access-date=9 March 2023 \|archive-date=16 March 2023 \|archive-url=https://web.archive.org/web/20230316181013/https://www.nature.com/articles/d41586-023-00641-w \|url-status=live }}	{{sort\|350000000000\|350 billion}} tokens (1.6TB){{cite web \|title=bigscience/bloom · Hugging Face \|url=https://huggingface.co/bigscience/bloom \|website=huggingface.co \|access-date=2023-03-13 \|archive-date=2023-04-12 \|archive-url=https://web.archive.org/web/20230412002547/https://huggingface.co/bigscience/bloom \|url-status=live }} \|	{{partial success\|Responsible AI}} \| Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages)
Galactica	{{dts\|2022-11}}	Meta	{{sort\|120\|120}}	{{sort\|350000000000\|106 billion}} tokens{{cite arXiv \|last1=Taylor \|first1=Ross \|last2=Kardas \|first2=Marcin \|last3=Cucurull \|first3=Guillem \|last4=Scialom \|first4=Thomas \|last5=Hartshorn \|first5=Anthony \|last6=Saravia \|first6=Elvis \|last7=Poulton \|first7=Andrew \|last8=Kerkez \|first8=Viktor \|last9=Stojnic \|first9=Robert \|title=Galactica: A Large Language Model for Science \|date=16 November 2022 \|class=cs.CL \|eprint=2211.09085}} \| {{Unknown}}	{{partial success\|CC-BY-NC-4.0}} \| Trained on scientific text and modalities.
AlexaTM (Teacher Models)	{{dts\|2022-11}}	Amazon	{{sort\|20\|20}}{{cite web \|title=20B-parameter Alexa model sets new marks in few-shot learning \|url=https://www.amazon.science/blog/20b-parameter-alexa-model-sets-new-marks-in-few-shot-learning \|website=Amazon Science \|date=2 August 2022 \|access-date=12 March 2023 \|archive-date=15 March 2023 \|archive-url=https://web.archive.org/web/20230315190223/https://www.amazon.science/blog/20b-parameter-alexa-model-sets-new-marks-in-few-shot-learning \|url-status=live }}	{{sort\|1300000000000\|1.3 trillion}}{{cite arXiv \|last1=Soltan \|first1=Saleh \|last2=Ananthakrishnan \|first2=Shankar \|last3=FitzGerald \|first3=Jack \|last4=Gupta \|first4=Rahul \|last5=Hamza \|first5=Wael \|last6=Khan \|first6=Haidar \|last7=Peris \|first7=Charith \|last8=Rawls \|first8=Stephen \|last9=Rosenbaum \|first9=Andy \|last10=Rumshisky \|first10=Anna \|last11=Prakash \|first11=Chandana Satya \|last12=Sridhar \|first12=Mukund \|last13=Triefenbach \|first13=Fabian \|last14=Verma \|first14=Apurv \|last15=Tur \|first15=Gokhan \|last16=Natarajan \|first16=Prem \|display-authors=3\|title=AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model \|eprint=2208.01448 \|date=3 August 2022\|class=cs.CL}} \|	{{proprietary}}{{cite web \|title=AlexaTM 20B is now available in Amazon SageMaker JumpStart {{!}} AWS Machine Learning Blog \|url=https://aws.amazon.com/blogs/machine-learning/alexatm-20b-is-now-available-in-amazon-sagemaker-jumpstart/ \|website=aws.amazon.com \|access-date=13 March 2023 \|date=17 November 2022 \|archive-date=13 March 2023 \|archive-url=https://web.archive.org/web/20230313163933/https://aws.amazon.com/blogs/machine-learning/alexatm-20b-is-now-available-in-amazon-sagemaker-jumpstart/ \|url-status=live }} \| bidirectional sequence-to-sequence architecture
LLaMA (Large Language Model Meta AI)	{{dts\|2023-02}}	Meta AI	{{sort\|65\|65}}	{{sort\|1400000000000\|1.4 trillion}}{{cite web \|work=Meta AI \|title=Introducing LLaMA: A foundational, 65-billion-parameter large language model \|date=24 February 2023 \|url=https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ \|access-date=9 March 2023 \|archive-date=3 March 2023 \|archive-url=https://web.archive.org/web/20230303112302/https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ \|url-status=live}} \|6300{{Cite web \|title=The Falcon has landed in the Hugging Face ecosystem \|url=https://huggingface.co/blog/falcon \|access-date=2023-06-20 \|website=huggingface.co \|archive-date=2023-06-20 \|archive-url=https://web.archive.org/web/20230620002832/https://huggingface.co/blog/falcon \|url-status=live }}	{{partial success\|Non-commercial research}}{{efn\|Facebook's license and distribution scheme restricted access to approved researchers, but the model weights were leaked and became widely available.}} \| Corpus has 20 languages. "Overtrained" (compared to Chinchilla scaling law) for better performance with fewer parameters.
GPT-4	{{dts\|2023-03}}	OpenAI	{{Unknown}}{{efn\|As stated in Technical report: "Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method ..."{{Cite web \|date=2023 \|title=GPT-4 Technical Report \|url=https://cdn.openai.com/papers/gpt-4.pdf \|website=OpenAI \|access-date=March 14, 2023 \|archive-date=March 14, 2023 \|archive-url=https://web.archive.org/web/20230314190904/https://cdn.openai.com/papers/gpt-4.pdf \|url-status=live}} }} (According to rumors: 1760){{Cite web \|last=Schreiner \|first=Maximilian \|date=2023-07-11 \|title=GPT-4 architecture, datasets, costs and more leaked \|url=https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/ \|access-date=2024-07-26 \|website=THE DECODER \|language=en-US \|archive-date=2023-07-12 \|archive-url=https://web.archive.org/web/20230712123915/https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/ \|url-status=live }} \| {{Unknown}} \| {{Unknown}}, estimated 230,000.	{{proprietary}} \| Available for ChatGPT Plus users and used in several products.
Chameleon	{{dts\|2024-06}}	Meta AI	{{sort\|34\|34}}{{cite news \|last1=Dickson \|first1=Ben \|title=Meta introduces Chameleon, a state-of-the-art multimodal model \|url=https://venturebeat.com/ai/meta-introduces-chameleon-a-state-of-the-art-multimodal-model/ \|work=VentureBeat \|date=22 May 2024}}	{{sort\|4400000000000\|4.4 trillion}}
Cerebras-GPT \|{{dts\|2023-03}} \|Cerebras \|{{sort\|13\|13}}{{Cite web\|url=https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/\|title=Cerebras-GPT: A Family of Open, Compute-efficient, Large Language Models\|first=Nolan\|last=Dey\|date=March 28, 2023\|website=Cerebras\|access-date=March 28, 2023\|archive-date=March 28, 2023\|archive-url=https://web.archive.org/web/20230328213339/https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/\|url-status=live}} \| \|270	{{yes\|Apache 2.0}} \| Trained with Chinchilla formula.
Falcon	{{dts\|2023-03}}	Technology Innovation Institute	{{sort\|40\|40}}{{cite web \|title=Abu Dhabi-based TII launches its own version of ChatGPT \|url=https://fastcompanyme.com/news/abu-dhabi-based-tii-launches-its-own-version-of-chatgpt/ \|website=tii.ae \|access-date=2023-04-03 \|archive-date=2023-04-03 \|archive-url=https://web.archive.org/web/20230403021729/https://fastcompanyme.com/news/abu-dhabi-based-tii-launches-its-own-version-of-chatgpt/ \|url-status=live }}	1 trillion tokens, from RefinedWeb (filtered web text corpus){{Cite arXiv \|last1=Penedo \|first1=Guilherme \|last2=Malartic \|first2=Quentin \|last3=Hesslow \|first3=Daniel \|last4=Cojocaru \|first4=Ruxandra \|last5=Cappelli \|first5=Alessandro \|last6=Alobeidli \|first6=Hamza \|last7=Pannier \|first7=Baptiste \|last8=Almazrouei \|first8=Ebtesam \|last9=Launay \|first9=Julien \|date=2023-06-01 \|title=The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only \|class=cs.CL \|eprint=2306.01116}} plus some "curated corpora".{{Cite web \|date=2023-06-09 \|title=tiiuae/falcon-40b · Hugging Face \|url=https://huggingface.co/tiiuae/falcon-40b \|access-date=2023-06-20 \|website=huggingface.co}} \|2800	{{yes\|Apache 2.0}}[https://www.businesswire.com/news/home/20230531005608/en/UAE's-Falcon-40B-World's-Top-Ranked-AI-Model-from-Technology-Innovation-Institute-is-Now-Royalty-Free UAE's Falcon 40B, World's Top-Ranked AI Model from Technology Innovation Institute, is Now Royalty-Free] {{Webarchive\|url=https://web.archive.org/web/20240208133040/https://www.businesswire.com/news/home/20230531005608/en/UAE%27s-Falcon-40B-World%27s-Top-Ranked-AI-Model-from-Technology-Innovation-Institute-is-Now-Royalty-Free \|date=2024-02-08 }}, 31 May 2023 \|
BloombergGPT	{{dts\|2023-03}}	Bloomberg L.P.	{{sort\|50\|50}}	363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets{{Cite arXiv\|title=BloombergGPT: A Large Language Model for Finance\|first1=Shijie\|last1=Wu\|first2=Ozan\|last2=Irsoy\|first3=Steven\|last3=Lu\|first4=Vadim\|last4=Dabravolski\|first5=Mark\|last5=Dredze\|first6=Sebastian\|last6=Gehrmann\|first7=Prabhanjan\|last7=Kambadur\|first8=David\|last8=Rosenberg\|first9=Gideon\|last9=Mann\|date=March 30, 2023\|class=cs.LG \|eprint=2303.17564}} \|	{{proprietary}} \| Trained on financial data from proprietary sources, for financial tasks.
PanGu-Σ	{{dts\|2023-03}}	Huawei	{{sort\|1085\|1085}}	329 billion tokens{{Cite arXiv\|title=PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing\|first1=Xiaozhe\|last1=Ren\|first2=Pingyi\|last2=Zhou\|first3=Xinfan\|last3=Meng\|first4=Xinjing\|last4=Huang\|first5=Yadao\|last5=Wang\|first6=Weichao\|last6=Wang\|first7=Pengfei\|last7=Li\|first8=Xiaoda\|last8=Zhang\|first9=Alexander\|last9=Podolskiy\|first10=Grigory\|last10=Arshinov\|first11=Andrey\|last11=Bout\|first12=Irina\|last12=Piontkovskaya\|first13=Jiansheng\|last13=Wei\|first14=Xin\|last14=Jiang\|first15=Teng\|last15=Su\|first16=Qun\|last16=Liu\|first17=Jun\|last17=Yao\|date=March 19, 2023\|class=cs.CL \|eprint=2303.10845}} \|	{{proprietary}} \|
OpenAssistant{{Cite arXiv \|last1=Köpf \|first1=Andreas \|last2=Kilcher \|first2=Yannic \|last3=von Rütte \|first3=Dimitri \|last4=Anagnostidis \|first4=Sotiris \|last5=Tam \|first5=Zhi-Rui \|last6=Stevens \|first6=Keith \|last7=Barhoum \|first7=Abdullah \|last8=Duc \|first8=Nguyen Minh \|last9=Stanley \|first9=Oliver \|last10=Nagyfi \|first10=Richárd \|last11=ES \|first11=Shahul \|last12=Suri \|first12=Sameer \|last13=Glushkov \|first13=David \|last14=Dantuluri \|first14=Arnav \|last15=Maguire \|first15=Andrew \|date=2023-04-14 \|title=OpenAssistant Conversations – Democratizing Large Language Model Alignment \|class=cs.CL \|eprint=2304.07327}}	{{dts\|2023-03}}	LAION	{{sort\|17\|17}}	1.5 trillion tokens \|	{{yes\|Apache 2.0}} \| Trained on crowdsourced open data
Jurassic-2{{Cite web \|last=Wrobel \|first=Sharon \|title=Tel Aviv startup rolls out new advanced AI language model to rival OpenAI \|url=https://www.timesofisrael.com/ai21-labs-rolls-out-new-advanced-ai-language-model-to-rival-openai/ \|access-date=2023-07-24 \|website=www.timesofisrael.com \|archive-date=2023-07-24 \|archive-url=https://web.archive.org/web/20230724191823/https://www.timesofisrael.com/ai21-labs-rolls-out-new-advanced-ai-language-model-to-rival-openai/ \|url-status=live }} \|{{dts\|2023-03}} \|AI21 Labs \| {{Unknown}} \| {{Unknown}} \|	{{proprietary}} \|Multilingual{{Cite web \|last=Wiggers \|first=Kyle \|date=2023-04-13 \|title=With Bedrock, Amazon enters the generative AI race \|url=https://techcrunch.com/2023/04/13/with-bedrock-amazon-enters-the-generative-ai-race/ \|access-date=2023-07-24 \|website=TechCrunch \|archive-date=2023-07-24 \|archive-url=https://web.archive.org/web/20230724102458/https://techcrunch.com/2023/04/13/with-bedrock-amazon-enters-the-generative-ai-race/ \|url-status=live }}
PaLM 2 (Pathways Language Model 2)	{{dts\|2023-05}}	Google	{{sort\|340\|340}}{{cite web \|last=Elias \|first=Jennifer \|url=https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html \|title=Google's newest A.I. model uses nearly five times more text data for training than its predecessor \|work=CNBC \|date=16 May 2023 \|access-date=18 May 2023 \|archive-date=16 May 2023 \|archive-url=https://web.archive.org/web/20230516225326/https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html \|url-status=live }}	{{sort\|3600000000000\|3.6 trillion}} tokens \|{{sort\|85000\|85,000}}	{{proprietary}} \| Was used in Bard chatbot.{{Cite web\|url=https://blog.google/technology/ai/google-palm-2-ai-large-language-model/\|title=Introducing PaLM 2\|date=May 10, 2023\|website=Google\|access-date=May 18, 2023\|archive-date=May 18, 2023\|archive-url=https://web.archive.org/web/20230518213209/https://blog.google/technology/ai/google-palm-2-ai-large-language-model/\|url-status=live}}
Llama 2	{{dts\|2023-07}}	Meta AI	{{sort\|70\|70}}{{Cite web \| url = https://ai.meta.com/llama/ \| title = Introducing Llama 2: The Next Generation of Our Open Source Large Language Model \| access-date = 2023-07-19 \| website = Meta AI \| date = 2023 \| archive-date = 2024-01-05 \| archive-url = https://web.archive.org/web/20240105234629/https://ai.meta.com/llama/ \| url-status = live }}	{{sort\|2000000000000\|2 trillion}} tokens \| {{sort\|21000\|21,000}}	{{partial success\|Llama 2 license}} \| 1.7 million A100-hours.{{Cite web \|title=llama/MODEL_CARD.md at main · meta-llama/llama \|url=https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md \|access-date=2024-05-28 \|website=GitHub \|archive-date=2024-05-28 \|archive-url=https://web.archive.org/web/20240528090541/https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md \|url-status=live }}
Claude 2 \|{{dts\|2023-07}} \|Anthropic \| {{Unknown}} \| {{Unknown}} \| {{Unknown}}	{{proprietary}} \| Used in Claude chatbot.{{cite web \|title=Claude 2 \|url=https://www.anthropic.com/index/claude-2 \|website=anthropic.com \|access-date=12 December 2023 \|archive-date=15 December 2023 \|archive-url=https://web.archive.org/web/20231215212208/https://www.anthropic.com/index/claude-2 \|url-status=live }}
Granite 13b \|{{dts\|2023-07}} \|IBM \| {{Unknown}} \| {{Unknown}} \| {{Unknown}}	{{proprietary}} \| Used in IBM Watsonx.{{Cite web \|last=Nirmal \|first=Dinesh \|date=2023-09-07 \|title=Building AI for business: IBM's Granite foundation models \|url=https://www.ibm.com/blog/building-ai-for-business-ibms-granite-foundation-models \|access-date=2024-08-11 \|website=IBM Blog \|language=en-US \|archive-date=2024-07-22 \|archive-url=https://web.archive.org/web/20240722083855/https://www.ibm.com/blog/building-ai-for-business-ibms-granite-foundation-models/ \|url-status=live }}
Mistral 7B	{{dts\|2023-09}}	Mistral AI	{{sort\|7.3\|7.3}}{{Cite web \| url = https://mistral.ai/news/announcing-mistral-7b/ \| title = Announcing Mistral 7B \| access-date = 2023-10-06 \| website = Mistral \| date = 2023 \| archive-date = 2024-01-06 \| archive-url = https://web.archive.org/web/20240106051047/https://mistral.ai/news/announcing-mistral-7b/ \| url-status = live }}	{{Unknown}} \|	{{yes\|Apache 2.0}} \|
Claude 2.1 \|{{dts\|2023-11}} \|Anthropic \| {{Unknown}} \| {{Unknown}} \| {{Unknown}}	{{proprietary}} \| Used in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages.{{cite web \|title=Introducing Claude 2.1 \|url=https://www.anthropic.com/index/claude-2-1 \|website=anthropic.com \|access-date=12 December 2023 \|archive-date=15 December 2023 \|archive-url=https://web.archive.org/web/20231215201726/https://www.anthropic.com/index/claude-2-1 \|url-status=live }}
Grok 1{{Citation \|title=xai-org/grok-1 \|date=2024-03-19 \|url=https://github.com/xai-org/grok-1 \|access-date=2024-03-19 \|publisher=xai-org \|archive-date=2024-05-28 \|archive-url=https://web.archive.org/web/20240528170731/https://github.com/xai-org/grok-1 \|url-status=live }} \|{{dts\|2023-11}} \|xAI	314 \| {{Unknown}} \| {{Unknown}}	{{yes\|Apache 2.0}} \| Used in Grok chatbot. Grok 1 has a context length of 8,192 tokens and has access to X (Twitter).{{cite web \|title=Grok-1 model card \|url=https://x.ai/model-card/ \|website=x.ai \|access-date=12 December 2023}}
Gemini 1.0 \|{{dts\|2023-12}} \|Google DeepMind \| {{Unknown}} \| {{Unknown}} \| {{Unknown}}	{{proprietary}} \| Multimodal model, comes in three sizes. Used in the chatbot of the same name.{{cite web \|title=Gemini – Google DeepMind \|url=https://deepmind.google/technologies/gemini/#capabilities \|website=deepmind.google \|access-date=12 December 2023 \|archive-date=8 December 2023 \|archive-url=https://web.archive.org/web/20231208015607/https://deepmind.google/technologies/gemini/#capabilities \|url-status=live }}
Mixtral 8x7B \|{{dts\|2023-12}} \|Mistral AI	46.7 \| {{Unknown}} \| {{Unknown}}	{{yes\|Apache 2.0}} \| Outperforms GPT-3.5 and Llama 2 70B on many benchmarks.{{cite web \|last1=Franzen \|first1=Carl \|title=Mistral shocks AI community as latest open source model eclipses GPT-3.5 performance \|url=https://venturebeat.com/ai/mistral-shocks-ai-community-as-latest-open-source-model-eclipses-gpt-3-5-performance/ \|website=VentureBeat \|access-date=12 December 2023 \|date=11 December 2023 \|archive-date=11 December 2023 \|archive-url=https://web.archive.org/web/20231211213640/https://venturebeat.com/ai/mistral-shocks-ai-community-as-latest-open-source-model-eclipses-gpt-3-5-performance/ \|url-status=live }} Mixture of experts model, with 12.9 billion parameters activated per token.{{cite web \|date=11 December 2023 \|title=Mixtral of experts \|url=https://mistral.ai/news/mixtral-of-experts/ \|access-date=12 December 2023 \|website=mistral.ai \|archive-date=13 February 2024 \|archive-url=https://web.archive.org/web/20240213104049/https://mistral.ai/news/mixtral-of-experts/ \|url-status=live }}
Mixtral 8x22B \|{{dts\|2024-04}} \|Mistral AI	141 \| {{Unknown}} \| {{Unknown}}	{{yes\|Apache 2.0}} \| {{Cite web \|last=AI \|first=Mistral \|date=2024-04-17 \|title=Cheaper, Better, Faster, Stronger \|url=https://mistral.ai/news/mixtral-8x22b/ \|access-date=2024-05-05 \|website=mistral.ai \|archive-date=2024-05-05 \|archive-url=https://web.archive.org/web/20240505023828/https://mistral.ai/news/mixtral-8x22b/ \|url-status=live }}
DeepSeek-LLM \|{{DTS\|2023-11-29}} \|DeepSeek \|67 \|2T tokens{{Citation \|last1=DeepSeek-AI \|title=DeepSeek LLM: Scaling Open-Source Language Models with Longtermism \|date=2024-01-05 \|arxiv=2401.02954 \|last2=Bi \|first2=Xiao \|last3=Chen \|first3=Deli \|last4=Chen \|first4=Guanting \|last5=Chen \|first5=Shanhuang \|last6=Dai \|first6=Damai \|last7=Deng \|first7=Chengqi \|last8=Ding \|first8=Honghui \|last9=Dong \|first9=Kai}}{{Pg\|location=table 2}} \|{{sort\|12000\|12,000}} \|{{partial success\|DeepSeek License}} \|Trained on English and Chinese text. 1e24 FLOPs for 67B. 1e23 FLOPs for 7B{{Pg\|location=figure 5}}
Phi-2 \|{{dts\|2023-12}} \|Microsoft	2.7 \|1.4T tokens \|419	{{yes\|MIT}} \| Trained on real and synthetic "textbook-quality" data, for 14 days on 96 A100 GPUs.{{cite web \|last1=Hughes \|first1=Alyssa \|title=Phi-2: The surprising power of small language models \|url=https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ \|website=Microsoft Research \|access-date=13 December 2023 \|date=12 December 2023 \|archive-date=12 December 2023 \|archive-url=https://web.archive.org/web/20231212232647/https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ \|url-status=live }}
Gemini 1.5 \|{{dts\|2024-02}} \|Google DeepMind	Unknown \| {{Unknown}} \| {{Unknown}}	{{proprietary}} \| Multimodal model, based on a Mixture-of-Experts (MoE) architecture. Context window above 1 million tokens.{{cite web \|title=Our next-generation model: Gemini 1.5 \|url=https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#context-window \|website=Google \|access-date=16 February 2024 \|date=15 February 2024 \|quote=This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we’ve also successfully tested up to 10 million tokens. \|archive-date=16 February 2024 \|archive-url=https://web.archive.org/web/20240216003052/https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#context-window \|url-status=live }}
Gemini Ultra \|{{dts\|2024-02}} \|Google DeepMind	Unknown \| {{Unknown}} \| {{Unknown}}	\|
Gemma	{{dts\|2024-02}}	Google DeepMind	7	6T tokens	Unknown	{{partial success\|Gemma Terms of Use}}{{cite web\|url=https://ai.google.dev/gemma/terms\|title=Gemma\|via=GitHub}}
Claude 3 \|{{dts\|2024-03}} \|Anthropic \| {{Unknown}} \| {{Unknown}} \| {{Unknown}} \|{{proprietary}} \|Includes three models, Haiku, Sonnet, and Opus.{{Cite web \|title=Introducing the next generation of Claude \|url=https://www.anthropic.com/news/claude-3-family \|access-date=2024-03-04 \|website=www.anthropic.com \|archive-date=2024-03-04 \|archive-url=https://web.archive.org/web/20240304143650/https://www.anthropic.com/news/claude-3-family \|url-status=live }}
[https://rubiks.ai/nova/release/ Nova] \|{{dts\|2024-10}} \|[https://rubiks.ai/ Rubik's AI] \| {{Unknown}} \| {{Unknown}} \| {{Unknown}} \|{{proprietary}} \|Previous three models, Nova-Instant, Nova-Air, and Nova-Pro. Company shifted to Sonus AI.
[https://sonus.ai/blog/sonus-1 Sonus]{{Cite web \|title=Sonus AI \|url=https://sonus.ai/ \|access-date=2025-03-07 \|website=sonus.ai \|language=en-US}} \|{{dts\|2025-01}} \|[https://rubiks.ai/ Rubik's AI] \| {{Unknown}} \| {{Unknown}} \| {{Unknown}} \|{{proprietary}} \|
DBRX \|{{dts\|2024-03}} \|Databricks and Mosaic ML \|{{sort\|136\|136}} \|12T Tokens \| \|{{Yes\|Databricks Open Model License}} \|Training cost 10 million USD.
Fugaku-LLM \|{{dts\|2024-05}} \|Fujitsu, Tokyo Institute of Technology, etc. \|{{sort\|13\|13}} \|380B Tokens \| \| \|The largest model ever trained on CPU-only, on the Fugaku.{{Cite web \|title=Fugaku-LLM/Fugaku-LLM-13B · Hugging Face \|url=https://huggingface.co/Fugaku-LLM/Fugaku-LLM-13B \|access-date=2024-05-17 \|website=huggingface.co \|archive-date=2024-05-17 \|archive-url=https://web.archive.org/web/20240517135225/https://huggingface.co/Fugaku-LLM/Fugaku-LLM-13B \|url-status=live }}
Phi-3 \|{{dts\|2024-04}} \|Microsoft \|14{{cite web\|title=Phi-3\|url=https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms\|access-date=2024-04-28\|website=azure.microsoft.com\|date=23 April 2024\|archive-date=2024-04-27\|archive-url=https://web.archive.org/web/20240427043835/https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/\|url-status=live}} \|4.8T Tokens \| \|{{Yes\|MIT}} \|Microsoft markets them as "small language model".{{cite web\|title=Phi-3 Model Documentation\|url=https://huggingface.co/docs/transformers/main/en/model_doc/phi3\|access-date=2024-04-28\|website=huggingface.co\|archive-date=2024-05-13\|archive-url=https://web.archive.org/web/20240513141513/https://huggingface.co/docs/transformers/main/en/model_doc/phi3\|url-status=live}}
Granite Code Models \|{{dts\|2024-05}} \|IBM	Unknown \| {{Unknown}} \| {{Unknown}}	{{yes\|Apache 2.0}} \|
Qwen2 \|{{dts\|2024-06}} \|Alibaba Cloud \|72{{cite web\|title= Qwen2\|website= GitHub\|url= https://github.com/QwenLM/Qwen2?spm=a3c0i.28768018.7084722650.1.5cd35c10NEqBXm&file=Qwen1.5\|access-date= 2024-06-17\|archive-date= 2024-06-17\|archive-url= https://web.archive.org/web/20240617072401/https://github.com/QwenLM/Qwen2?spm=a3c0i.28768018.7084722650.1.5cd35c10NEqBXm&file=Qwen1.5\|url-status= live}} \|3T Tokens \| {{Unknown}} \|{{partial success\|Qwen License}} \|Multiple sizes, the smallest being 0.5B.
DeepSeek-V2 \|{{DTS\|2024-06}} \|DeepSeek \|236 \|8.1T tokens \|{{sort\|28000\|28,000}} \|{{partial success\|DeepSeek License}} \|1.4M hours on H800.{{Citation \|last1=DeepSeek-AI \|title=DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model \|date=2024-06-19 \|arxiv=2405.04434 \|last2=Liu \|first2=Aixin \|last3=Feng \|first3=Bei \|last4=Wang \|first4=Bin \|last5=Wang \|first5=Bingxuan \|last6=Liu \|first6=Bo \|last7=Zhao \|first7=Chenggang \|last8=Dengr \|first8=Chengqi \|last9=Ruan \|first9=Chong}}
Nemotron-4 \|{{dts\|2024-06}} \|Nvidia \|{{sort\|340\|340}} \|9T Tokens \| {{sort\|200000\|200,000}} \|{{Yes\|NVIDIA Open Model License}} \|Trained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024.{{Cite web \|date=2024-06-14 \|title=nvidia/Nemotron-4-340B-Base · Hugging Face \|url=https://huggingface.co/nvidia/Nemotron-4-340B-Base \|access-date=2024-06-15 \|website=huggingface.co \|archive-date=2024-06-15 \|archive-url=https://web.archive.org/web/20240615010323/https://huggingface.co/nvidia/Nemotron-4-340B-Base \|url-status=live }}{{Cite web \|title=Nemotron-4 340B {{!}} Research \|url=https://research.nvidia.com/publication/2024-06_nemotron-4-340b \|access-date=2024-06-15 \|website=research.nvidia.com \|archive-date=2024-06-15 \|archive-url=https://web.archive.org/web/20240615010323/https://research.nvidia.com/publication/2024-06_nemotron-4-340b \|url-status=live }}
Llama 3.1 \|{{dts\|2024-07}} \|Meta AI \|405 \|15.6T tokens \|{{sort\|440000\|440,000}} \| {{partial success\|Llama 3 license}} \|405B version took 31 million hours on H100-80GB, at 3.8E25 FLOPs.[https://ai.meta.com/research/publications/the-llama-3-herd-of-models/ "The Llama 3 Herd of Models" (July 23, 2024) Llama Team, AI @ Meta]{{Cite web \|title=llama-models/models/llama3_1/MODEL_CARD.md at main · meta-llama/llama-models \|url=https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md \|access-date=2024-07-23 \|website=GitHub \|language=en \|archive-date=2024-07-23 \|archive-url=https://web.archive.org/web/20240723151851/https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md \|url-status=live }}
DeepSeek-V3 \|{{dts\|2024-12}} \|DeepSeek \|671 \|14.8T tokens \|{{sort\|56000\|56,000}} \|{{Yes\|MIT}} \|2.788M hours on H800 GPUs.{{Citation \|title=deepseek-ai/DeepSeek-V3 \|date=2024-12-26 \|url=https://github.com/deepseek-ai/DeepSeek-V3?tab=readme-ov-file \|access-date=2024-12-26 \|publisher=DeepSeek}} Originally released under the DeepSeek License, then re-released under the MIT License as "DeepSeek-V3-0324" in March 2025.{{cite web \|last1=Feng \|first1=Coco \|title=DeepSeek wows coders with more powerful open-source V3 model \|url=https://www.scmp.com/tech/big-tech/article/3303798/deepseeks-upgraded-foundational-model-excels-coding-and-maths \|website=South China Morning Post \|access-date=6 April 2025 \|language=en \|date=25 March 2025}}
Amazon Nova \|{{dts\|2024-12}} \|Amazon \| {{Unknown}} \| {{Unknown}} \| {{Unknown}} \|{{proprietary}} \|Includes three models, Nova Micro, Nova Lite, and Nova Pro{{Citation \|title=Amazon Nova Micro, Lite, and Pro - AWS AI Service Cards3 \|date=2024-12-27 \|url=https://docs.aws.amazon.com/ai/responsible-ai/nova-micro-lite-pro/overview.html \|access-date=2024-12-27 \|publisher=Amazon}}
DeepSeek-R1 \|{{dts\|2025-01}} \|DeepSeek \|671 \|{{n/a\|Not applicable}} \| {{Unknown}} \|{{Yes\|MIT}} \|No pretraining. Reinforcement-learned upon V3-Base.{{Citation \|title=deepseek-ai/DeepSeek-R1 \|date=2025-01-21 \|url=https://github.com/deepseek-ai/DeepSeek-R1 \|access-date=2025-01-21 \|publisher=DeepSeek}}{{Citation \|last1=DeepSeek-AI \|title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning \|date=2025-01-22 \|arxiv=2501.12948 \|last2=Guo \|first2=Daya \|last3=Yang \|first3=Dejian \|last4=Zhang \|first4=Haowei \|last5=Song \|first5=Junxiao \|last6=Zhang \|first6=Ruoyu \|last7=Xu \|first7=Runxin \|last8=Zhu \|first8=Qihao \|last9=Ma \|first9=Shirong}}
Qwen2.5 \|{{dts\|2025-01}} \|Alibaba \|72 \|18T tokens \| {{Unknown}} \|{{partial success\|Qwen License}} \|7 dense models, with parameter count from 0.5B to 72B. They also released 2 MoE variants.{{Citation \|last1=Qwen \|title=Qwen2.5 Technical Report \|date=2025-01-03 \|arxiv=2412.15115 \|last2=Yang \|first2=An \|last3=Yang \|first3=Baosong \|last4=Zhang \|first4=Beichen \|last5=Hui \|first5=Binyuan \|last6=Zheng \|first6=Bo \|last7=Yu \|first7=Bowen \|last8=Li \|first8=Chengyuan \|last9=Liu \|first9=Dayiheng}}
MiniMax-Text-01 \|{{dts\|2025-01}} \|Minimax \|456 \|4.7T tokens \| {{Unknown}} \|{{partial success\|Minimax Model license}} \|{{Citation \|title=MiniMax-AI/MiniMax-01 \|date=2025-01-26 \|url=https://github.com/MiniMax-AI/MiniMax-01?tab=readme-ov-file \|access-date=2025-01-26 \|publisher=MiniMax}}{{Citation \|last1=MiniMax \|title=MiniMax-01: Scaling Foundation Models with Lightning Attention \|date=2025-01-14 \|arxiv=2501.08313 \|last2=Li \|first2=Aonian \|last3=Gong \|first3=Bangwei \|last4=Yang \|first4=Bo \|last5=Shan \|first5=Boji \|last6=Liu \|first6=Chang \|last7=Zhu \|first7=Cheng \|last8=Zhang \|first8=Chunhao \|last9=Guo \|first9=Congchao}}
Gemini 2.0 \|{{dts\|2025-02}} \|Google DeepMind	Unknown \| {{Unknown}} \| {{Unknown}}	{{proprietary}} \| Three models released: Flash, Flash-Lite and Pro{{cite web \|last1=Kavukcuoglu \|first1=Koray \|title=Gemini 2.0 is now available to everyone \|url=https://blog.google/technology/google-deepmind/gemini-model-updates-february-2025/ \|website=Google \|date=5 February 2025 \|access-date=6 February 2025}}{{cite web \|title=Gemini 2.0: Flash, Flash-Lite and Pro \|url=https://developers.googleblog.com/en/gemini-2-family-expands/ \|website=Google for Developers \|access-date=6 February 2025}}{{cite news \|last1=Franzen \|first1=Carl \|title=Google launches Gemini 2.0 Pro, Flash-Lite and connects reasoning model Flash Thinking to YouTube, Maps and Search \|url=https://venturebeat.com/ai/google-launches-gemini-2-0-pro-flash-lite-and-connects-reasoning-model-flash-thinking-to-youtube-maps-and-search/ \|access-date=6 February 2025 \|work=VentureBeat \|date=5 February 2025}}
Mistral Large \|{{dts\|2024-11}} \|Mistral AI	123 \| {{Unknown}} \| {{Unknown}} \|{{partial success\|Mistral Research License}} \|Upgraded over time. The latest version is 24.11.{{cite web \|url=https://docs.mistral.ai/getting-started/models/models_overview/ \|title=Models Overview \|website=mistral.ai \|access-date=2025-03-03}}
Pixtral \|{{dts\|2024-11}} \|Mistral AI	123 \| {{Unknown}} \| {{Unknown}} \|{{partial success\|Mistral Research License}} \|Multimodal. There is also a 12B version which is under Apache 2 license.
Grok 3 \|{{dts\|2025-02}} \|xAI \| {{Unknown}} \| {{Unknown}} \| {{Unknown}}, estimated 5,800,000. \|{{proprietary}} \|Training cost claimed "10x the compute of previous state-of-the-art models".{{Cite web \|title=Grok 3 Beta — The Age of Reasoning Agents \|url=https://x.ai/blog/grok-3 \|access-date=2025-02-22 \|website=x.ai \|language=en}}
Llama 4 \|{{dts\|2025-04-05}} \|Meta AI \|{{sort\|400\|400}} \|{{sort\|40000000000000\|40T tokens}} \|	{{partial success\|Llama 4 license}} \|{{Cite web \|date=2025-04-05 \|title=meta-llama/Llama-4-Maverick-17B-128E · Hugging Face \|url=https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E \|access-date=2025-04-06 \|website=huggingface.co}}{{Cite web \|title=The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation \|url=https://ai.meta.com/blog/llama-4-multimodal-intelligence/ \|archive-url=http://web.archive.org/web/20250405185132/https://ai.meta.com/blog/llama-4-multimodal-intelligence/ \|archive-date=2025-04-05 \|access-date=2025-04-05 \|website=ai.meta.com \|language=en}}
Qwen3 \|{{dts\|2025-04}} \|Alibaba Cloud \|235 \|{{sort\|36000000000000\|36T tokens}} \| {{Unknown}} \|{{yes\|Apache 2.0}} \|Multiple sizes, the smallest being 0.6B.{{Cite web \|last=Team \|first=Qwen \|date=2025-04-29 \|title=Qwen3: Think Deeper, Act Faster \|url=https://qwenlm.github.io/blog/qwen3/ \|access-date=2025-04-29 \|website=Qwen \|language=en}}

Notes

References

Category:Software comparisons

Category:Large language models

Name	Release date{{efn\|This is the date that documentation describing the model's architecture was first released.}}	Developer	Number of parameters (billion) {{efn\|In many cases, researchers release or report on multiple versions of a model having different sizes. In these cases, the size of the largest model is listed here.}}	Corpus size !Training cost (petaFLOP-day)	License{{efn\|This is the license of the pre-trained model weights. In almost all cases the training code itself is open-source or can be easily replicated.}}	Notes
class="wikitable sortable sort-under col2right col4right col5right col6right" style="font-size:smaller"
Attention Is All You Need \|{{dts\|2017-06}} \|Vaswani et al at Google \|0.213 \|36 million English-French sentence pairs \|0.09{{Cite web \|date=2022-06-09 \|title=AI and compute \|url=https://openai.com/index/ai-and-compute/ \|access-date=2025-04-24 \|website=openai.com \|language=en-US}} \| \|Trained for 0.3M steps on 8 NVIDIA P100 GPUs.
GPT-1	{{dts\|2018-06}}	OpenAI	{{sort\|0.117\|0.117}}	\| 1{{cite web \|date=June 11, 2018 \|title=Improving language understanding with unsupervised learning \|url=https://openai.com/research/language-unsupervised \|url-status=live \|archive-url=https://web.archive.org/web/20230318210736/https://openai.com/research/language-unsupervised \|archive-date=2023-03-18 \|access-date=2023-03-18 \|website=openai.com }}	{{yes\|MIT}}{{cite web\|work=GitHub\|title=finetune-transformer-lm\|url=https://github.com/openai/finetune-transformer-lm\|access-date=2 January 2024\|archive-date=19 May 2023\|archive-url=https://web.archive.org/web/20230519062127/https://github.com/openai/finetune-transformer-lm\|url-status=live}} \| First GPT model, decoder-only transformer. Trained for 30 days on 8 P600 GPUs.
BERT	{{dts\|2018-10}}	Google	{{sort\|0.340\|0.340}}{{cite arXiv \|last1=Devlin \|first1=Jacob \|last2=Chang \|first2=Ming-Wei \|last3=Lee \|first3=Kenton \|last4=Toutanova \|first4=Kristina \|title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding \|date=11 October 2018 \|eprint=1810.04805v2\|class=cs.CL }}	{{sort\|3300000000\|3.3 billion}} words \|{{sort\|9\|9}}{{Cite web \|last=Prickett \|first=Nicole Hemsoth \|date=2021-08-24 \|title=Cerebras Shifts Architecture To Meet Massive AI/ML Models \|url=https://www.nextplatform.com/2021/08/24/cerebras-shifts-architecture-to-meet-massive-ai-ml-models/ \|access-date=2023-06-20 \|website=The Next Platform \|archive-date=2023-06-20 \|archive-url=https://web.archive.org/web/20230620151619/https://www.nextplatform.com/2021/08/24/cerebras-shifts-architecture-to-meet-massive-ai-ml-models/ \|url-status=live }}	{{yes\|Apache 2.0}}{{Cite web\|url=https://github.com/google-research/bert\|title=BERT\|date=March 13, 2023\|via=GitHub\|access-date=March 13, 2023\|archive-date=January 13, 2021\|archive-url=https://web.archive.org/web/20210113211317/https://github.com/google-research/bert\|url-status=live}} \| An early and influential language model.{{cite journal \|last=Manning \|first=Christopher D. \|author-link=Christopher D. Manning \|year=2022 \|title=Human Language Understanding & Reasoning \|url=https://www.amacad.org/publication/human-language-understanding-reasoning \|journal=Daedalus \|volume=151 \|issue=2 \|pages=127–138 \|doi=10.1162/daed_a_01905 \|s2cid=248377870 \|doi-access=free \|access-date=2023-03-09 \|archive-date=2023-11-17 \|archive-url=https://web.archive.org/web/20231117205531/https://www.amacad.org/publication/human-language-understanding-reasoning \|url-status=live }}Encoder-only and thus not built to be prompted or generative.{{cite arXiv \|last1=Patel \|first1=Ajay \|last2=Li \|first2=Bryan \|last3=Rasooli \|first3=Mohammad Sadegh \|last4=Constant \|first4=Noah \|last5=Raffel \|first5=Colin \|last6=Callison-Burch \|first6=Chris \|title=Bidirectional Language Models Are Also Few-shot Learners \|date=2022 \|class=cs.LG \|eprint=2209.14500}} Training took 4 days on 64 TPUv2 chips.{{cite arXiv \|eprint=1810.04805v2 \|class=cs.CL \|first1=Jacob \|last1=Devlin \|first2=Ming-Wei \|last2=Chang \|title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding \|date=11 October 2018 \|last3=Lee \|first3=Kenton \|last4=Toutanova \|first4=Kristina}}
T5 \|{{dts\|2019-10}} \|Google \|{{sort\|11\|11}}{{Cite journal \|last1=Raffel \|first1=Colin \|last2=Shazeer \|first2=Noam \|last3=Roberts \|first3=Adam \|last4=Lee \|first4=Katherine \|last5=Narang \|first5=Sharan \|last6=Matena \|first6=Michael \|last7=Zhou \|first7=Yanqi \|last8=Li \|first8=Wei \|last9=Liu \|first9=Peter J. \|date=2020 \|title=Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer \|url=http://jmlr.org/papers/v21/20-074.html \|journal=Journal of Machine Learning Research \|volume=21 \|issue=140 \|pages=1–67 \|arxiv=1910.10683 \|issn=1533-7928}} \|34 billion tokens \| \| {{yes\|Apache 2.0}}{{Citation \|title=google-research/text-to-text-transfer-transformer \|date=2024-04-02 \|url=https://github.com/google-research/text-to-text-transfer-transformer \|access-date=2024-04-04 \|publisher=Google Research \|archive-date=2024-03-29 \|archive-url=https://web.archive.org/web/20240329112957/https://github.com/google-research/text-to-text-transfer-transformer \|url-status=live }} \|Base model for many Google projects, such as Imagen.{{Cite web \|title=Imagen: Text-to-Image Diffusion Models \|url=https://imagen.research.google/ \|access-date=2024-04-04 \|website=imagen.research.google \|archive-date=2024-03-27 \|archive-url=https://web.archive.org/web/20240327201713/https://imagen.research.google/ \|url-status=live }}
XLNet	{{dts\|2019-06}}	Google	{{sort\|0.340\|0.340}}{{Cite web \|title=Pretrained models — transformers 2.0.0 documentation \|url=https://huggingface.co/transformers/v2.0.0/pretrained_models.html \|access-date=2024-08-05 \|website=huggingface.co \|archive-date=2024-08-05 \|archive-url=https://web.archive.org/web/20240805032110/https://huggingface.co/transformers/v2.0.0/pretrained_models.html \|url-status=live }}	{{sort\|3300000000\|33}} billion words \| 330	{{yes\|Apache 2.0}}{{cite web\|work=GitHub\|title=xlnet\|url=https://github.com/zihangdai/xlnet/\|access-date=2 January 2024\|archive-date=2 January 2024\|archive-url=https://web.archive.org/web/20240102191842/https://github.com/zihangdai/xlnet/\|url-status=live}} \| An alternative to BERT; designed as encoder-only. Trained on 512 TPU v3 chips for 5.5 days.{{cite arXiv \|last1=Yang \|first1=Zhilin \|last2=Dai \|first2=Zihang \|last3=Yang \|first3=Yiming \|last4=Carbonell \|first4=Jaime \|last5=Salakhutdinov \|first5=Ruslan \|last6=Le \|first6=Quoc V. \|title=XLNet: Generalized Autoregressive Pretraining for Language Understanding \|date=2 January 2020 \|class=cs.CL \|eprint=1906.08237}}
GPT-2	{{dts\|2019-02}}	OpenAI	{{sort\|1.5\|1.5}}{{Cite web \|url = https://openai.com/blog/gpt-2-1-5b-release/ \|title = GPT-2: 1.5B Release \|date = 2019-11-05 \|website = OpenAI \|language = en \|access-date = 2019-11-14 \|archive-date = 2019-11-14 \|archive-url = https://web.archive.org/web/20191114074358/https://openai.com/blog/gpt-2-1-5b-release/ \|url-status = live}}	40GB{{cite web \|title=Better language models and their implications \|url=https://openai.com/research/better-language-models \|website=openai.com \|access-date=2023-03-13 \|archive-date=2023-03-16 \|archive-url=https://web.archive.org/web/20230316160730/https://openai.com/research/better-language-models \|url-status=live }} (~{{sort\|10000000000\|10 billion}} tokens){{cite web \|title=OpenAI's GPT-3 Language Model: A Technical Overview \|url=https://lambdalabs.com/blog/demystifying-gpt-3 \|website=lambdalabs.com \|date=3 June 2020 \|access-date=13 March 2023 \|archive-date=27 March 2023 \|archive-url=https://web.archive.org/web/20230327213811/https://lambdalabs.com/blog/demystifying-gpt-3 \|url-status=live }} \| 28{{Cite web \|title=openai-community/gpt2-xl · Hugging Face \|url=https://huggingface.co/openai-community/gpt2-xl \|access-date=2024-07-24 \|website=huggingface.co \|archive-date=2024-07-24 \|archive-url=https://web.archive.org/web/20240724041702/https://huggingface.co/openai-community/gpt2-xl \|url-status=live }}	{{yes\|MIT}}{{cite web\|work=GitHub\|title=gpt-2\|url=https://github.com/openai/gpt-2\|access-date=13 March 2023\|archive-date=11 March 2023\|archive-url=https://web.archive.org/web/20230311154936/https://github.com/openai/gpt-2\|url-status=live}} \| Trained on 32 TPUv3 chips for 1 week.
GPT-3	{{dts\|2020-05}}	OpenAI	{{sort\|175\|175}}{{cite web \|last=Wiggers \|first=Kyle \|date=28 April 2022 \|title=The emerging types of language models and why they matter \|url=https://techcrunch.com/2022/04/28/the-emerging-types-of-language-models-and-why-they-matter/ \|work=TechCrunch \|access-date=9 March 2023 \|archive-date=16 March 2023 \|archive-url=https://web.archive.org/web/20230316072443/https://techcrunch.com/2022/04/28/the-emerging-types-of-language-models-and-why-they-matter/ \|url-status=live }}	{{sort\|300000000000\|300 billion}} tokens \|3640Table D.1 in {{Cite arXiv \|last1=Brown \|first1=Tom B. \|last2=Mann \|first2=Benjamin \|last3=Ryder \|first3=Nick \|last4=Subbiah \|first4=Melanie \|last5=Kaplan \|first5=Jared \|last6=Dhariwal \|first6=Prafulla \|last7=Neelakantan \|first7=Arvind \|last8=Shyam \|first8=Pranav \|last9=Sastry \|first9=Girish \|last10=Askell \|first10=Amanda \|last11=Agarwal \|first11=Sandhini \|last12=Herbert-Voss \|first12=Ariel \|last13=Krueger \|first13=Gretchen \|last14=Henighan \|first14=Tom \|last15=Child \|first15=Rewon \|date=May 28, 2020 \|title=Language Models are Few-Shot Learners \|eprint=2005.14165v4 \|first16=Aditya \|last16=Ramesh \|first17=Daniel M. \|last17=Ziegler \|first18=Jeffrey \|last18=Wu \|first19=Clemens \|last19=Winter \|first20=Christopher \|last20=Hesse \|first21=Mark \|last21=Chen \|first22=Eric \|last22=Sigler \|first23=Mateusz \|last23=Litwin \|first24=Scott \|last24=Gray \|first25=Benjamin \|last25=Chess \|first26=Jack \|last26=Clark \|first27=Christopher \|last27=Berner \|first28=Sam \|last28=McCandlish \|first29=Alec \|last29=Radford \|first30=Ilya \|last30=Sutskever \|first31=Dario \|last31=Amodei\|class=cs.CL}}	{{proprietary}} \| A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.{{Cite web \|date=2022-11-30 \|title=ChatGPT: Optimizing Language Models for Dialogue \|url=https://openai.com/blog/chatgpt/ \|access-date=2023-01-13 \|website=OpenAI \|archive-date=2022-11-30 \|archive-url=https://web.archive.org/web/20221130180912/https://openai.com/blog/chatgpt/ \|url-status=live}}
GPT-Neo	{{dts\|2021-03}}	EleutherAI	{{sort\|2.7\|2.7}}{{Cite web\|url=https://github.com/EleutherAI/gpt-neo\|title=GPT Neo\|date=March 15, 2023\|via=GitHub\|access-date=March 12, 2023\|archive-date=March 12, 2023\|archive-url=https://web.archive.org/web/20230312225202/https://github.com/EleutherAI/gpt-neo\|url-status=live}}	825 GiB{{cite arXiv \|last1=Gao \|first1=Leo \|last2=Biderman \|first2=Stella \|last3=Black \|first3=Sid \|last4=Golding \|first4=Laurence \|last5=Hoppe \|first5=Travis \|last6=Foster \|first6=Charles \|last7=Phang \|first7=Jason \|last8=He \|first8=Horace \|last9=Thite \|first9=Anish \|last10=Nabeshima \|first10=Noa \|last11=Presser \|first11=Shawn \|last12=Leahy \|first12=Connor \|title=The Pile: An 800GB Dataset of Diverse Text for Language Modeling \|eprint=2101.00027\|date=31 December 2020 \|class=cs.CL}} \|	{{yes\|MIT}} \| The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.{{cite web \|work=VentureBeat \|last=Iyer \|first=Abhishek \|title=GPT-3's free alternative GPT-Neo is something to be excited about \|date=15 May 2021 \|url=https://venturebeat.com/ai/gpt-3s-free-alternative-gpt-neo-is-something-to-be-excited-about/ \|access-date=13 March 2023 \|archive-date=9 March 2023 \|archive-url=https://web.archive.org/web/20230309012717/https://venturebeat.com/ai/gpt-3s-free-alternative-gpt-neo-is-something-to-be-excited-about/ \|url-status=live}}
GPT-J	{{dts\|2021-06}}	EleutherAI	{{sort\|6\|6}}{{Cite web \|title=GPT-J-6B: An Introduction to the Largest Open Source GPT Model {{!}} Forefront \|url=https://www.forefront.ai/blog-posts/gpt-j-6b-an-introduction-to-the-largest-open-sourced-gpt-model \|access-date=2023-02-28 \|website=www.forefront.ai \|archive-date=2023-03-09 \|archive-url=https://web.archive.org/web/20230309205439/https://www.forefront.ai/blog-posts/gpt-j-6b-an-introduction-to-the-largest-open-sourced-gpt-model \|url-status=dead }}	825 GiB \|200{{Cite arXiv \|last1=Dey \|first1=Nolan \|last2=Gosal \|first2=Gurpreet \|last3=Zhiming \|last4=Chen \|last5=Khachane \|first5=Hemant \|last6=Marshall \|first6=William \|last7=Pathria \|first7=Ribhu \|last8=Tom \|first8=Marvin \|last9=Hestness \|first9=Joel \|date=2023-04-01 \|title=Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster \|class=cs.LG \|eprint=2304.03208}}	{{yes\|Apache 2.0}} \| GPT-3-style language model
Megatron-Turing NLG	{{dts\|2021-10}} {{cite web \|last1=Alvi \|first1=Ali \|last2=Kharya \|first2=Paresh \|title=Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World's Largest and Most Powerful Generative Language Model \|url=https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/ \|website=Microsoft Research \|date=11 October 2021 \|access-date=13 March 2023 \|archive-date=13 March 2023 \|archive-url=https://web.archive.org/web/20230313180531/https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/ \|url-status=live }}	Microsoft and Nvidia	{{sort\|530\|530}}{{Cite arXiv \|last1=Smith \|first1=Shaden \|last2=Patwary \|first2=Mostofa \|last3=Norick \|first3=Brandon \|last4=LeGresley \|first4=Patrick \|last5=Rajbhandari \|first5=Samyam \|last6=Casper \|first6=Jared \|last7=Liu \|first7=Zhun \|last8=Prabhumoye \|first8=Shrimai \|last9=Zerveas \|first9=George \|last10=Korthikanti \|first10=Vijay \|last11=Zhang \|first11=Elton \|last12=Child \|first12=Rewon \|last13=Aminabadi \|first13=Reza Yazdani \|last14=Bernauer \|first14=Julie \|last15=Song \|first15=Xia \|date=2022-02-04 \|title=Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model \|class=cs.CL \|eprint=2201.11990}}	{{sort\|338600000000\|338.6 billion}} tokens \| 38000	{{no\|Restricted web access}} \| Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours.{{Citation \|last1=Rajbhandari \|first1=Samyam \|title=DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale \|date=2022-07-21 \|arxiv=2201.05596 \|last2=Li \|first2=Conglong \|last3=Yao \|first3=Zhewei \|last4=Zhang \|first4=Minjia \|last5=Aminabadi \|first5=Reza Yazdani \|last6=Awan \|first6=Ammar Ahmad \|last7=Rasley \|first7=Jeff \|last8=He \|first8=Yuxiong}}
Ernie 3.0 Titan	{{dts\|2021-12}}	Baidu	{{sort\|260\|260}}{{Cite arXiv\|title=ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation\|first1=Shuohuan\|last1=Wang\|first2=Yu\|last2=Sun\|first3=Yang\|last3=Xiang\|first4=Zhihua\|last4=Wu\|first5=Siyu\|last5=Ding\|first6=Weibao\|last6=Gong\|first7=Shikun\|last7=Feng\|first8=Junyuan\|last8=Shang\|first9=Yanbin\|last9=Zhao\|first10=Chao\|last10=Pang\|first11=Jiaxiang\|last11=Liu\|first12=Xuyi\|last12=Chen\|first13=Yuxiang\|last13=Lu\|first14=Weixin\|last14=Liu\|first15=Xi\|last15=Wang\|first16=Yangfan\|last16=Bai\|first17=Qiuliang\|last17=Chen\|first18=Li\|last18=Zhao\|first19=Shiyong\|last19=Li\|first20=Peng\|last20=Sun\|first21=Dianhai\|last21=Yu\|first22=Yanjun\|last22=Ma\|first23=Hao\|last23=Tian\|first24=Hua\|last24=Wu\|first25=Tian\|last25=Wu\|first26=Wei\|last26=Zeng\|first27=Ge\|last27=Li\|first28=Wen\|last28=Gao\|first29=Haifeng\|last29=Wang\|date=December 23, 2021\|class=cs.CL \|eprint=2112.12731}}	4 Tb \|	{{proprietary}} \| Chinese-language LLM. Ernie Bot is based on this model.
Claude{{cite web \|title=Product \|url=https://www.anthropic.com/product \|website=Anthropic \|access-date=14 March 2023 \|archive-date=16 March 2023 \|archive-url=https://web.archive.org/web/20230316145444/https://www.anthropic.com/product \|url-status=live }}	{{dts\|2021-12}}	Anthropic	{{sort\|52\|52}}{{cite arXiv \|last1=Askell \|first1=Amanda \|last2=Bai \|first2=Yuntao \|last3=Chen \|first3=Anna \|last4=Drain \|first4=Dawn \|last5=Ganguli \|first5=Deep \|last6=Henighan \|first6=Tom \|last7=Jones \|first7=Andy \|last8=Joseph \|first8=Nicholas \|last9=Mann \|first9=Ben \|last10=DasSarma \|first10=Nova \|last11=Elhage \|first11=Nelson \|last12=Hatfield-Dodds \|first12=Zac \|last13=Hernandez \|first13=Danny \|last14=Kernion \|first14=Jackson \|last15=Ndousse \|first15=Kamal \|last16=Olsson \|first16=Catherine \|last17=Amodei \|first17=Dario \|last18=Brown \|first18=Tom \|last19=Clark \|first19=Jack \|last20=McCandlish \|first20=Sam \|last21=Olah \|first21=Chris \|last22=Kaplan \|first22=Jared \|display-authors=3 \|title=A General Language Assistant as a Laboratory for Alignment \|eprint=2112.00861 \|date=9 December 2021 \|class=cs.CL}}	{{sort\|400000000000\|400 billion}} tokens \|	{{partial success\|beta}} \| Fine-tuned for desirable behavior in conversations.{{cite arXiv \|last1=Bai \|first1=Yuntao \|last2=Kadavath \|first2=Saurav \|last3=Kundu \|first3=Sandipan \|last4=Askell \|first4=Amanda \|last5=Kernion \|first5=Jackson \|last6=Jones \|first6=Andy \|last7=Chen \|first7=Anna \|last8=Goldie \|first8=Anna \|last9=Mirhoseini \|first9=Azalia \|last10=McKinnon \|first10=Cameron \|last11=Chen \|first11=Carol \|last12=Olsson \|first12=Catherine \|last13=Olah \|first13=Christopher \|last14=Hernandez \|first14=Danny \|last15=Drain \|first15=Dawn \|last16=Ganguli \|first16=Deep \|last17=Li \|first17=Dustin \|last18=Tran-Johnson \|first18=Eli \|last19=Perez \|first19=Ethan \|last20=Kerr \|first20=Jamie \|last21=Mueller \|first21=Jared \|last22=Ladish \|first22=Jeffrey \|last23=Landau \|first23=Joshua \|last24=Ndousse \|first24=Kamal \|last25=Lukosuite \|first25=Kamile \|last26=Lovitt \|first26=Liane \|last27=Sellitto \|first27=Michael \|last28=Elhage \|first28=Nelson \|last29=Schiefer \|first29=Nicholas \|last30=Mercado \|first30=Noemi \|last31=DasSarma \|first31=Nova \|last32=Lasenby \|first32=Robert \|last33=Larson \|first33=Robin \|last34=Ringer \|first34=Sam \|last35=Johnston \|first35=Scott \|last36=Kravec \|first36=Shauna \|last37=Showk \|first37=Sheer El \|last38=Fort \|first38=Stanislav \|last39=Lanham \|first39=Tamera \|last40=Telleen-Lawton \|first40=Timothy \|last41=Conerly \|first41=Tom \|last42=Henighan \|first42=Tom \|last43=Hume \|first43=Tristan \|last44=Bowman \|first44=Samuel R. \|last45=Hatfield-Dodds \|first45=Zac \|last46=Mann \|first46=Ben \|last47=Amodei \|first47=Dario \|last48=Joseph \|first48=Nicholas \|last49=McCandlish \|first49=Sam \|last50=Brown \|first50=Tom \|last51=Kaplan \|first51=Jared \|display-authors=3 \|title=Constitutional AI: Harmlessness from AI Feedback \|eprint=2212.08073 \|date=15 December 2022 \|class=cs.CL}}
GLaM (Generalist Language Model)	{{dts\|2021-12}}	Google	{{sort\|1200\|1200}}	{{sort\|1600000000000\|1.6 trillion}} tokens{{Cite web \|last1=Dai \|first1=Andrew M \|last2=Du \|first2=Nan \|date=December 9, 2021 \|title=More Efficient In-Context Learning with GLaM \|url=https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html \|access-date=2023-03-09 \|website=ai.googleblog.com \|archive-date=2023-03-12 \|archive-url=https://web.archive.org/web/20230312072042/https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html \|url-status=live}} \| 5600	{{proprietary}} \| Sparse mixture of experts model, making it more expensive to train but cheaper to run inference compared to GPT-3.
Gopher	{{dts\|2021-12}}	DeepMind	{{sort\|280\|280}}{{cite web \|title=Language modelling at scale: Gopher, ethical considerations, and retrieval \|url=https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval \|website=www.deepmind.com \|date=8 December 2021 \|access-date=20 March 2023 \|archive-date=20 March 2023 \|archive-url=https://web.archive.org/web/20230320082323/https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval \|url-status=live }}	{{sort\|300000000000\|300 billion}} tokens \|5833Table 20 and page 66 of [https://storage.googleapis.com/pathways-language-model/PaLM-paper.pdf PaLM: Scaling Language Modeling with Pathways] {{Webarchive\|url=https://web.archive.org/web/20230610040050/https://storage.googleapis.com/pathways-language-model/PaLM-paper.pdf \|date=2023-06-10 }}	{{proprietary}} \| Later developed into the Chinchilla model.
LaMDA (Language Models for Dialog Applications)	{{dts\|2022-01}}	Google	{{sort\|137\|137}}	1.56T words,{{Cite web \|last1=Cheng \|first1=Heng-Tze \|last2=Thoppilan \|first2=Romal \|date=January 21, 2022 \|title=LaMDA: Towards Safe, Grounded, and High-Quality Dialog Models for Everything \|url=https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html \|access-date=2023-03-09 \|website=ai.googleblog.com \|archive-date=2022-03-25 \|archive-url=https://web.archive.org/web/20220325014118/https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html \|url-status=live}} {{sort\|168000000000\|168 billion}} tokens \|4110{{Cite arXiv \|last1=Thoppilan \|first1=Romal \|last2=De Freitas \|first2=Daniel \|last3=Hall \|first3=Jamie \|last4=Shazeer \|first4=Noam \|last5=Kulshreshtha \|first5=Apoorv \|last6=Cheng \|first6=Heng-Tze \|last7=Jin \|first7=Alicia \|last8=Bos \|first8=Taylor \|last9=Baker \|first9=Leslie \|last10=Du \|first10=Yu \|last11=Li \|first11=YaGuang \|last12=Lee \|first12=Hongrae \|last13=Zheng \|first13=Huaixiu Steven \|last14=Ghafouri \|first14=Amin \|last15=Menegali \|first15=Marcelo \|date=2022-01-01 \|title=LaMDA: Language Models for Dialog Applications \|class=cs.CL \|eprint=2201.08239}}	{{proprietary}} \| Specialized for response generation in conversations.
GPT-NeoX	{{dts\|2022-02}}	EleutherAI	{{sort\|20\|20}}{{cite conference \|title=GPT-NeoX-20B: An Open-Source Autoregressive Language Model \|conference=Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models \|date=2022-05-01 \|last1=Black \|first1=Sidney \|last2=Biderman \|first2=Stella \|last3=Hallahan \|first3=Eric \|display-authors=etal \|volume=Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models \|pages=95–136 \|url=https://aclanthology.org/2022.bigscience-1.9/ \|access-date=2022-12-19 \|archive-date=2022-12-10 \|archive-url=https://web.archive.org/web/20221210082456/https://aclanthology.org/2022.bigscience-1.9/ \|url-status=live }}	825 GiB \|740	{{yes\|Apache 2.0}} \| based on the Megatron architecture
Chinchilla	{{dts\|2022-03}}	DeepMind	{{sort\|70\|70}}	{{sort\|1400000000000\|1.4 trillion}} tokens{{cite web \|work=Deepmind Blog \|title=An empirical analysis of compute-optimal large language model training \|first1=Jordan \|last1=Hoffmann \|first2=Sebastian \|last2=Borgeaud \|first3=Arthur \|last3=Mensch \|first4=Laurent \|last4=Sifre \|date=12 April 2022 \|url=https://www.deepmind.com/blog/an-empirical-analysis-of-compute-optimal-large-language-model-training \|access-date=9 March 2023 \|archive-date=13 April 2022 \|archive-url=https://web.archive.org/web/20220413014510/https://www.deepmind.com/blog/an-empirical-analysis-of-compute-optimal-large-language-model-training \|url-status=live}}{{cite arXiv \|last1=Hoffmann \|first1=Jordan \|last2=Borgeaud \|first2=Sebastian \|last3=Mensch \|first3=Arthur \|last4=Buchatskaya \|first4=Elena \|last5=Cai \|first5=Trevor \|last6=Rutherford \|first6=Eliza \|last7=Casas \|first7=Diego de Las \|last8=Hendricks \|first8=Lisa Anne \|last9=Welbl \|first9=Johannes \|last10=Clark \|first10=Aidan \|last11=Hennigan \|first11=Tom \|last12=Noland \|first12=Eric \|last13=Millican \|first13=Katie \|last14=Driessche \|first14=George van den \|last15=Damoc \|first15=Bogdan \|last16=Guy \|first16=Aurelia \|last17=Osindero \|first17=Simon \|last18=Simonyan \|first18=Karen \|last19=Elsen \|first19=Erich \|last20=Rae \|first20=Jack W. \|last21=Vinyals \|first21=Oriol \|last22=Sifre \|first22=Laurent \|title=Training Compute-Optimal Large Language Models \|eprint=2203.15556 \|date=29 March 2022 \|class=cs.CL \|display-authors=3}} \|6805	{{proprietary}} \| Reduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law.
PaLM (Pathways Language Model)	{{dts\|2022-04}}	Google	{{sort\|540\|540}}{{Cite web \|last1=Narang \|first1=Sharan \|last2=Chowdhery \|first2=Aakanksha \|date=April 4, 2022 \|title=Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance \|url=https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html \|access-date=2023-03-09 \|website=ai.googleblog.com \|language=en \|archive-date=2022-04-04 \|archive-url=https://web.archive.org/web/20220404161447/https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html \|url-status=live}}	{{sort\|768000000000\|768 billion}} tokens \|{{sort\|29250\|29,250}}	{{proprietary}} \| Trained for ~60 days on ~6000 TPU v4 chips. {{As of\|2024\|October}}, it is the largest dense Transformer published.
OPT (Open Pretrained Transformer)	{{dts\|2022-05}}	Meta	{{sort\|175\|175}}{{cite web \|title=Democratizing access to large-scale language models with OPT-175B \|author1=Susan Zhang \|author2=Mona Diab \|author3=Luke Zettlemoyer \|url=https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/ \|website=ai.facebook.com \|access-date=2023-03-12 \|archive-date=2023-03-12 \|archive-url=https://web.archive.org/web/20230312231820/https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/ \|url-status=live }}	{{sort\|180000000000\|180 billion}} tokens{{cite arXiv \|last1=Zhang \|first1=Susan \|last2=Roller \|first2=Stephen \|last3=Goyal \|first3=Naman \|last4=Artetxe \|first4=Mikel \|last5=Chen \|first5=Moya \|last6=Chen \|first6=Shuohui \|last7=Dewan \|first7=Christopher \|last8=Diab \|first8=Mona \|last9=Li \|first9=Xian \|last10=Lin \|first10=Xi Victoria \|last11=Mihaylov \|first11=Todor \|last12=Ott \|first12=Myle \|last13=Shleifer \|first13=Sam \|last14=Shuster \|first14=Kurt \|last15=Simig \|first15=Daniel \|last16=Koura \|first16=Punit Singh \|last17=Sridhar \|first17=Anjali \|last18=Wang \|first18=Tianlu \|last19=Zettlemoyer \|first19=Luke \|title=OPT: Open Pre-trained Transformer Language Models \|eprint=2205.01068 \|date=21 June 2022\|class=cs.CL}} \|310	{{partial success\|Non-commercial research}}{{efn\|The smaller models including 66B are publicly available, while the 175B model is available on request.}} \| GPT-3 architecture with some adaptations from Megatron. Uniquely, the training logbook written by the team was published.{{Cite web \|title=metaseq/projects/OPT/chronicles at main · facebookresearch/metaseq \|url=https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/chronicles \|access-date=2024-10-18 \|website=GitHub \|language=en}}
YaLM 100B	{{dts\|2022-06}}	Yandex	{{sort\|100\|100}}{{Citation \|last1=Khrushchev \|first1=Mikhail \|title=YaLM 100B \|date=2022-06-22 \|url=https://github.com/yandex/YaLM-100B \|access-date=2023-03-18 \|last2=Vasilev \|first2=Ruslan \|last3=Petrov \|first3=Alexey \|last4=Zinov \|first4=Nikolay \|archive-date=2023-06-16 \|archive-url=https://web.archive.org/web/20230616050056/https://github.com/yandex/YaLM-100B \|url-status=live }}	1.7TB	\|	{{Yes\|Apache 2.0}}	English-Russian model based on Microsoft's Megatron-LM.
Minerva	{{dts\|2022-06}}	Google	{{sort\|540\|540}}	38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server{{cite arXiv \|last1=Lewkowycz \|first1=Aitor \|last2=Andreassen \|first2=Anders \|last3=Dohan \|first3=David \|last4=Dyer \|first4=Ethan \|last5=Michalewski \|first5=Henryk \|last6=Ramasesh \|first6=Vinay \|last7=Slone \|first7=Ambrose \|last8=Anil \|first8=Cem \|last9=Schlag \|first9=Imanol \|last10=Gutman-Solo \|first10=Theo \|last11=Wu \|first11=Yuhuai \|last12=Neyshabur \|first12=Behnam \|last13=Gur-Ari \|first13=Guy \|last14=Misra \|first14=Vedant \|title=Solving Quantitative Reasoning Problems with Language Models \|date=30 June 2022 \|class=cs.CL \|eprint=2206.14858}} \|	{{proprietary}} \| For solving "mathematical and scientific questions using step-by-step reasoning".{{cite web \|title=Minerva: Solving Quantitative Reasoning Problems with Language Models \|url=https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html \|website=ai.googleblog.com \|date=30 June 2022 \|access-date=20 March 2023 }} Initialized from PaLM models, then finetuned on mathematical and scientific data.
BLOOM	{{dts\|2022-07}}	Large collaboration led by Hugging Face	{{sort\|175\|175}}{{cite journal \|journal=Nature \|last=Ananthaswamy \|first=Anil \|title=In AI, is bigger always better? \|date=8 March 2023 \|volume=615 \|issue=7951 \|pages=202–205 \|doi=10.1038/d41586-023-00641-w \|pmid=36890378 \|bibcode=2023Natur.615..202A \|s2cid=257380916 \|url=https://www.nature.com/articles/d41586-023-00641-w \|access-date=9 March 2023 \|archive-date=16 March 2023 \|archive-url=https://web.archive.org/web/20230316181013/https://www.nature.com/articles/d41586-023-00641-w \|url-status=live }}	{{sort\|350000000000\|350 billion}} tokens (1.6TB){{cite web \|title=bigscience/bloom · Hugging Face \|url=https://huggingface.co/bigscience/bloom \|website=huggingface.co \|access-date=2023-03-13 \|archive-date=2023-04-12 \|archive-url=https://web.archive.org/web/20230412002547/https://huggingface.co/bigscience/bloom \|url-status=live }} \|	{{partial success\|Responsible AI}} \| Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages)
Galactica	{{dts\|2022-11}}	Meta	{{sort\|120\|120}}	{{sort\|350000000000\|106 billion}} tokens{{cite arXiv \|last1=Taylor \|first1=Ross \|last2=Kardas \|first2=Marcin \|last3=Cucurull \|first3=Guillem \|last4=Scialom \|first4=Thomas \|last5=Hartshorn \|first5=Anthony \|last6=Saravia \|first6=Elvis \|last7=Poulton \|first7=Andrew \|last8=Kerkez \|first8=Viktor \|last9=Stojnic \|first9=Robert \|title=Galactica: A Large Language Model for Science \|date=16 November 2022 \|class=cs.CL \|eprint=2211.09085}} \| {{Unknown}}	{{partial success\|CC-BY-NC-4.0}} \| Trained on scientific text and modalities.
AlexaTM (Teacher Models)	{{dts\|2022-11}}	Amazon	{{sort\|20\|20}}{{cite web \|title=20B-parameter Alexa model sets new marks in few-shot learning \|url=https://www.amazon.science/blog/20b-parameter-alexa-model-sets-new-marks-in-few-shot-learning \|website=Amazon Science \|date=2 August 2022 \|access-date=12 March 2023 \|archive-date=15 March 2023 \|archive-url=https://web.archive.org/web/20230315190223/https://www.amazon.science/blog/20b-parameter-alexa-model-sets-new-marks-in-few-shot-learning \|url-status=live }}	{{sort\|1300000000000\|1.3 trillion}}{{cite arXiv \|last1=Soltan \|first1=Saleh \|last2=Ananthakrishnan \|first2=Shankar \|last3=FitzGerald \|first3=Jack \|last4=Gupta \|first4=Rahul \|last5=Hamza \|first5=Wael \|last6=Khan \|first6=Haidar \|last7=Peris \|first7=Charith \|last8=Rawls \|first8=Stephen \|last9=Rosenbaum \|first9=Andy \|last10=Rumshisky \|first10=Anna \|last11=Prakash \|first11=Chandana Satya \|last12=Sridhar \|first12=Mukund \|last13=Triefenbach \|first13=Fabian \|last14=Verma \|first14=Apurv \|last15=Tur \|first15=Gokhan \|last16=Natarajan \|first16=Prem \|display-authors=3\|title=AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model \|eprint=2208.01448 \|date=3 August 2022\|class=cs.CL}} \|	{{proprietary}}{{cite web \|title=AlexaTM 20B is now available in Amazon SageMaker JumpStart {{!}} AWS Machine Learning Blog \|url=https://aws.amazon.com/blogs/machine-learning/alexatm-20b-is-now-available-in-amazon-sagemaker-jumpstart/ \|website=aws.amazon.com \|access-date=13 March 2023 \|date=17 November 2022 \|archive-date=13 March 2023 \|archive-url=https://web.archive.org/web/20230313163933/https://aws.amazon.com/blogs/machine-learning/alexatm-20b-is-now-available-in-amazon-sagemaker-jumpstart/ \|url-status=live }} \| bidirectional sequence-to-sequence architecture
LLaMA (Large Language Model Meta AI)	{{dts\|2023-02}}	Meta AI	{{sort\|65\|65}}	{{sort\|1400000000000\|1.4 trillion}}{{cite web \|work=Meta AI \|title=Introducing LLaMA: A foundational, 65-billion-parameter large language model \|date=24 February 2023 \|url=https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ \|access-date=9 March 2023 \|archive-date=3 March 2023 \|archive-url=https://web.archive.org/web/20230303112302/https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ \|url-status=live}} \|6300{{Cite web \|title=The Falcon has landed in the Hugging Face ecosystem \|url=https://huggingface.co/blog/falcon \|access-date=2023-06-20 \|website=huggingface.co \|archive-date=2023-06-20 \|archive-url=https://web.archive.org/web/20230620002832/https://huggingface.co/blog/falcon \|url-status=live }}	{{partial success\|Non-commercial research}}{{efn\|Facebook's license and distribution scheme restricted access to approved researchers, but the model weights were leaked and became widely available.}} \| Corpus has 20 languages. "Overtrained" (compared to Chinchilla scaling law) for better performance with fewer parameters.
GPT-4	{{dts\|2023-03}}	OpenAI	{{Unknown}}{{efn\|As stated in Technical report: "Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method ..."{{Cite web \|date=2023 \|title=GPT-4 Technical Report \|url=https://cdn.openai.com/papers/gpt-4.pdf \|website=OpenAI \|access-date=March 14, 2023 \|archive-date=March 14, 2023 \|archive-url=https://web.archive.org/web/20230314190904/https://cdn.openai.com/papers/gpt-4.pdf \|url-status=live}} }} (According to rumors: 1760){{Cite web \|last=Schreiner \|first=Maximilian \|date=2023-07-11 \|title=GPT-4 architecture, datasets, costs and more leaked \|url=https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/ \|access-date=2024-07-26 \|website=THE DECODER \|language=en-US \|archive-date=2023-07-12 \|archive-url=https://web.archive.org/web/20230712123915/https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/ \|url-status=live }} \| {{Unknown}} \| {{Unknown}}, estimated 230,000.	{{proprietary}} \| Available for ChatGPT Plus users and used in several products.
Chameleon	{{dts\|2024-06}}	Meta AI	{{sort\|34\|34}}{{cite news \|last1=Dickson \|first1=Ben \|title=Meta introduces Chameleon, a state-of-the-art multimodal model \|url=https://venturebeat.com/ai/meta-introduces-chameleon-a-state-of-the-art-multimodal-model/ \|work=VentureBeat \|date=22 May 2024}}	{{sort\|4400000000000\|4.4 trillion}}
Cerebras-GPT \|{{dts\|2023-03}} \|Cerebras \|{{sort\|13\|13}}{{Cite web\|url=https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/\|title=Cerebras-GPT: A Family of Open, Compute-efficient, Large Language Models\|first=Nolan\|last=Dey\|date=March 28, 2023\|website=Cerebras\|access-date=March 28, 2023\|archive-date=March 28, 2023\|archive-url=https://web.archive.org/web/20230328213339/https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/\|url-status=live}} \| \|270	{{yes\|Apache 2.0}} \| Trained with Chinchilla formula.
Falcon	{{dts\|2023-03}}	Technology Innovation Institute	{{sort\|40\|40}}{{cite web \|title=Abu Dhabi-based TII launches its own version of ChatGPT \|url=https://fastcompanyme.com/news/abu-dhabi-based-tii-launches-its-own-version-of-chatgpt/ \|website=tii.ae \|access-date=2023-04-03 \|archive-date=2023-04-03 \|archive-url=https://web.archive.org/web/20230403021729/https://fastcompanyme.com/news/abu-dhabi-based-tii-launches-its-own-version-of-chatgpt/ \|url-status=live }}	1 trillion tokens, from RefinedWeb (filtered web text corpus){{Cite arXiv \|last1=Penedo \|first1=Guilherme \|last2=Malartic \|first2=Quentin \|last3=Hesslow \|first3=Daniel \|last4=Cojocaru \|first4=Ruxandra \|last5=Cappelli \|first5=Alessandro \|last6=Alobeidli \|first6=Hamza \|last7=Pannier \|first7=Baptiste \|last8=Almazrouei \|first8=Ebtesam \|last9=Launay \|first9=Julien \|date=2023-06-01 \|title=The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only \|class=cs.CL \|eprint=2306.01116}} plus some "curated corpora".{{Cite web \|date=2023-06-09 \|title=tiiuae/falcon-40b · Hugging Face \|url=https://huggingface.co/tiiuae/falcon-40b \|access-date=2023-06-20 \|website=huggingface.co}} \|2800	{{yes\|Apache 2.0}}[https://www.businesswire.com/news/home/20230531005608/en/UAE's-Falcon-40B-World's-Top-Ranked-AI-Model-from-Technology-Innovation-Institute-is-Now-Royalty-Free UAE's Falcon 40B, World's Top-Ranked AI Model from Technology Innovation Institute, is Now Royalty-Free] {{Webarchive\|url=https://web.archive.org/web/20240208133040/https://www.businesswire.com/news/home/20230531005608/en/UAE%27s-Falcon-40B-World%27s-Top-Ranked-AI-Model-from-Technology-Innovation-Institute-is-Now-Royalty-Free \|date=2024-02-08 }}, 31 May 2023 \|
BloombergGPT	{{dts\|2023-03}}	Bloomberg L.P.	{{sort\|50\|50}}	363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets{{Cite arXiv\|title=BloombergGPT: A Large Language Model for Finance\|first1=Shijie\|last1=Wu\|first2=Ozan\|last2=Irsoy\|first3=Steven\|last3=Lu\|first4=Vadim\|last4=Dabravolski\|first5=Mark\|last5=Dredze\|first6=Sebastian\|last6=Gehrmann\|first7=Prabhanjan\|last7=Kambadur\|first8=David\|last8=Rosenberg\|first9=Gideon\|last9=Mann\|date=March 30, 2023\|class=cs.LG \|eprint=2303.17564}} \|	{{proprietary}} \| Trained on financial data from proprietary sources, for financial tasks.
PanGu-Σ	{{dts\|2023-03}}	Huawei	{{sort\|1085\|1085}}	329 billion tokens{{Cite arXiv\|title=PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing\|first1=Xiaozhe\|last1=Ren\|first2=Pingyi\|last2=Zhou\|first3=Xinfan\|last3=Meng\|first4=Xinjing\|last4=Huang\|first5=Yadao\|last5=Wang\|first6=Weichao\|last6=Wang\|first7=Pengfei\|last7=Li\|first8=Xiaoda\|last8=Zhang\|first9=Alexander\|last9=Podolskiy\|first10=Grigory\|last10=Arshinov\|first11=Andrey\|last11=Bout\|first12=Irina\|last12=Piontkovskaya\|first13=Jiansheng\|last13=Wei\|first14=Xin\|last14=Jiang\|first15=Teng\|last15=Su\|first16=Qun\|last16=Liu\|first17=Jun\|last17=Yao\|date=March 19, 2023\|class=cs.CL \|eprint=2303.10845}} \|	{{proprietary}} \|
OpenAssistant{{Cite arXiv \|last1=Köpf \|first1=Andreas \|last2=Kilcher \|first2=Yannic \|last3=von Rütte \|first3=Dimitri \|last4=Anagnostidis \|first4=Sotiris \|last5=Tam \|first5=Zhi-Rui \|last6=Stevens \|first6=Keith \|last7=Barhoum \|first7=Abdullah \|last8=Duc \|first8=Nguyen Minh \|last9=Stanley \|first9=Oliver \|last10=Nagyfi \|first10=Richárd \|last11=ES \|first11=Shahul \|last12=Suri \|first12=Sameer \|last13=Glushkov \|first13=David \|last14=Dantuluri \|first14=Arnav \|last15=Maguire \|first15=Andrew \|date=2023-04-14 \|title=OpenAssistant Conversations – Democratizing Large Language Model Alignment \|class=cs.CL \|eprint=2304.07327}}	{{dts\|2023-03}}	LAION	{{sort\|17\|17}}	1.5 trillion tokens \|	{{yes\|Apache 2.0}} \| Trained on crowdsourced open data
Jurassic-2{{Cite web \|last=Wrobel \|first=Sharon \|title=Tel Aviv startup rolls out new advanced AI language model to rival OpenAI \|url=https://www.timesofisrael.com/ai21-labs-rolls-out-new-advanced-ai-language-model-to-rival-openai/ \|access-date=2023-07-24 \|website=www.timesofisrael.com \|archive-date=2023-07-24 \|archive-url=https://web.archive.org/web/20230724191823/https://www.timesofisrael.com/ai21-labs-rolls-out-new-advanced-ai-language-model-to-rival-openai/ \|url-status=live }} \|{{dts\|2023-03}} \|AI21 Labs \| {{Unknown}} \| {{Unknown}} \|	{{proprietary}} \|Multilingual{{Cite web \|last=Wiggers \|first=Kyle \|date=2023-04-13 \|title=With Bedrock, Amazon enters the generative AI race \|url=https://techcrunch.com/2023/04/13/with-bedrock-amazon-enters-the-generative-ai-race/ \|access-date=2023-07-24 \|website=TechCrunch \|archive-date=2023-07-24 \|archive-url=https://web.archive.org/web/20230724102458/https://techcrunch.com/2023/04/13/with-bedrock-amazon-enters-the-generative-ai-race/ \|url-status=live }}
PaLM 2 (Pathways Language Model 2)	{{dts\|2023-05}}	Google	{{sort\|340\|340}}{{cite web \|last=Elias \|first=Jennifer \|url=https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html \|title=Google's newest A.I. model uses nearly five times more text data for training than its predecessor \|work=CNBC \|date=16 May 2023 \|access-date=18 May 2023 \|archive-date=16 May 2023 \|archive-url=https://web.archive.org/web/20230516225326/https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html \|url-status=live }}	{{sort\|3600000000000\|3.6 trillion}} tokens \|{{sort\|85000\|85,000}}	{{proprietary}} \| Was used in Bard chatbot.{{Cite web\|url=https://blog.google/technology/ai/google-palm-2-ai-large-language-model/\|title=Introducing PaLM 2\|date=May 10, 2023\|website=Google\|access-date=May 18, 2023\|archive-date=May 18, 2023\|archive-url=https://web.archive.org/web/20230518213209/https://blog.google/technology/ai/google-palm-2-ai-large-language-model/\|url-status=live}}
Llama 2	{{dts\|2023-07}}	Meta AI	{{sort\|70\|70}}{{Cite web \| url = https://ai.meta.com/llama/ \| title = Introducing Llama 2: The Next Generation of Our Open Source Large Language Model \| access-date = 2023-07-19 \| website = Meta AI \| date = 2023 \| archive-date = 2024-01-05 \| archive-url = https://web.archive.org/web/20240105234629/https://ai.meta.com/llama/ \| url-status = live }}	{{sort\|2000000000000\|2 trillion}} tokens \| {{sort\|21000\|21,000}}	{{partial success\|Llama 2 license}} \| 1.7 million A100-hours.{{Cite web \|title=llama/MODEL_CARD.md at main · meta-llama/llama \|url=https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md \|access-date=2024-05-28 \|website=GitHub \|archive-date=2024-05-28 \|archive-url=https://web.archive.org/web/20240528090541/https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md \|url-status=live }}
Claude 2 \|{{dts\|2023-07}} \|Anthropic \| {{Unknown}} \| {{Unknown}} \| {{Unknown}}	{{proprietary}} \| Used in Claude chatbot.{{cite web \|title=Claude 2 \|url=https://www.anthropic.com/index/claude-2 \|website=anthropic.com \|access-date=12 December 2023 \|archive-date=15 December 2023 \|archive-url=https://web.archive.org/web/20231215212208/https://www.anthropic.com/index/claude-2 \|url-status=live }}
Granite 13b \|{{dts\|2023-07}} \|IBM \| {{Unknown}} \| {{Unknown}} \| {{Unknown}}	{{proprietary}} \| Used in IBM Watsonx.{{Cite web \|last=Nirmal \|first=Dinesh \|date=2023-09-07 \|title=Building AI for business: IBM's Granite foundation models \|url=https://www.ibm.com/blog/building-ai-for-business-ibms-granite-foundation-models \|access-date=2024-08-11 \|website=IBM Blog \|language=en-US \|archive-date=2024-07-22 \|archive-url=https://web.archive.org/web/20240722083855/https://www.ibm.com/blog/building-ai-for-business-ibms-granite-foundation-models/ \|url-status=live }}
Mistral 7B	{{dts\|2023-09}}	Mistral AI	{{sort\|7.3\|7.3}}{{Cite web \| url = https://mistral.ai/news/announcing-mistral-7b/ \| title = Announcing Mistral 7B \| access-date = 2023-10-06 \| website = Mistral \| date = 2023 \| archive-date = 2024-01-06 \| archive-url = https://web.archive.org/web/20240106051047/https://mistral.ai/news/announcing-mistral-7b/ \| url-status = live }}	{{Unknown}} \|	{{yes\|Apache 2.0}} \|
Claude 2.1 \|{{dts\|2023-11}} \|Anthropic \| {{Unknown}} \| {{Unknown}} \| {{Unknown}}	{{proprietary}} \| Used in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages.{{cite web \|title=Introducing Claude 2.1 \|url=https://www.anthropic.com/index/claude-2-1 \|website=anthropic.com \|access-date=12 December 2023 \|archive-date=15 December 2023 \|archive-url=https://web.archive.org/web/20231215201726/https://www.anthropic.com/index/claude-2-1 \|url-status=live }}
Grok 1{{Citation \|title=xai-org/grok-1 \|date=2024-03-19 \|url=https://github.com/xai-org/grok-1 \|access-date=2024-03-19 \|publisher=xai-org \|archive-date=2024-05-28 \|archive-url=https://web.archive.org/web/20240528170731/https://github.com/xai-org/grok-1 \|url-status=live }} \|{{dts\|2023-11}} \|xAI	314 \| {{Unknown}} \| {{Unknown}}	{{yes\|Apache 2.0}} \| Used in Grok chatbot. Grok 1 has a context length of 8,192 tokens and has access to X (Twitter).{{cite web \|title=Grok-1 model card \|url=https://x.ai/model-card/ \|website=x.ai \|access-date=12 December 2023}}
Gemini 1.0 \|{{dts\|2023-12}} \|Google DeepMind \| {{Unknown}} \| {{Unknown}} \| {{Unknown}}	{{proprietary}} \| Multimodal model, comes in three sizes. Used in the chatbot of the same name.{{cite web \|title=Gemini – Google DeepMind \|url=https://deepmind.google/technologies/gemini/#capabilities \|website=deepmind.google \|access-date=12 December 2023 \|archive-date=8 December 2023 \|archive-url=https://web.archive.org/web/20231208015607/https://deepmind.google/technologies/gemini/#capabilities \|url-status=live }}
Mixtral 8x7B \|{{dts\|2023-12}} \|Mistral AI	46.7 \| {{Unknown}} \| {{Unknown}}	{{yes\|Apache 2.0}} \| Outperforms GPT-3.5 and Llama 2 70B on many benchmarks.{{cite web \|last1=Franzen \|first1=Carl \|title=Mistral shocks AI community as latest open source model eclipses GPT-3.5 performance \|url=https://venturebeat.com/ai/mistral-shocks-ai-community-as-latest-open-source-model-eclipses-gpt-3-5-performance/ \|website=VentureBeat \|access-date=12 December 2023 \|date=11 December 2023 \|archive-date=11 December 2023 \|archive-url=https://web.archive.org/web/20231211213640/https://venturebeat.com/ai/mistral-shocks-ai-community-as-latest-open-source-model-eclipses-gpt-3-5-performance/ \|url-status=live }} Mixture of experts model, with 12.9 billion parameters activated per token.{{cite web \|date=11 December 2023 \|title=Mixtral of experts \|url=https://mistral.ai/news/mixtral-of-experts/ \|access-date=12 December 2023 \|website=mistral.ai \|archive-date=13 February 2024 \|archive-url=https://web.archive.org/web/20240213104049/https://mistral.ai/news/mixtral-of-experts/ \|url-status=live }}
Mixtral 8x22B \|{{dts\|2024-04}} \|Mistral AI	141 \| {{Unknown}} \| {{Unknown}}	{{yes\|Apache 2.0}} \| {{Cite web \|last=AI \|first=Mistral \|date=2024-04-17 \|title=Cheaper, Better, Faster, Stronger \|url=https://mistral.ai/news/mixtral-8x22b/ \|access-date=2024-05-05 \|website=mistral.ai \|archive-date=2024-05-05 \|archive-url=https://web.archive.org/web/20240505023828/https://mistral.ai/news/mixtral-8x22b/ \|url-status=live }}
DeepSeek-LLM \|{{DTS\|2023-11-29}} \|DeepSeek \|67 \|2T tokens{{Citation \|last1=DeepSeek-AI \|title=DeepSeek LLM: Scaling Open-Source Language Models with Longtermism \|date=2024-01-05 \|arxiv=2401.02954 \|last2=Bi \|first2=Xiao \|last3=Chen \|first3=Deli \|last4=Chen \|first4=Guanting \|last5=Chen \|first5=Shanhuang \|last6=Dai \|first6=Damai \|last7=Deng \|first7=Chengqi \|last8=Ding \|first8=Honghui \|last9=Dong \|first9=Kai}}{{Pg\|location=table 2}} \|{{sort\|12000\|12,000}} \|{{partial success\|DeepSeek License}} \|Trained on English and Chinese text. 1e24 FLOPs for 67B. 1e23 FLOPs for 7B{{Pg\|location=figure 5}}
Phi-2 \|{{dts\|2023-12}} \|Microsoft	2.7 \|1.4T tokens \|419	{{yes\|MIT}} \| Trained on real and synthetic "textbook-quality" data, for 14 days on 96 A100 GPUs.{{cite web \|last1=Hughes \|first1=Alyssa \|title=Phi-2: The surprising power of small language models \|url=https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ \|website=Microsoft Research \|access-date=13 December 2023 \|date=12 December 2023 \|archive-date=12 December 2023 \|archive-url=https://web.archive.org/web/20231212232647/https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ \|url-status=live }}
Gemini 1.5 \|{{dts\|2024-02}} \|Google DeepMind	Unknown \| {{Unknown}} \| {{Unknown}}	{{proprietary}} \| Multimodal model, based on a Mixture-of-Experts (MoE) architecture. Context window above 1 million tokens.{{cite web \|title=Our next-generation model: Gemini 1.5 \|url=https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#context-window \|website=Google \|access-date=16 February 2024 \|date=15 February 2024 \|quote=This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we’ve also successfully tested up to 10 million tokens. \|archive-date=16 February 2024 \|archive-url=https://web.archive.org/web/20240216003052/https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#context-window \|url-status=live }}
Gemini Ultra \|{{dts\|2024-02}} \|Google DeepMind	Unknown \| {{Unknown}} \| {{Unknown}}	\|
Gemma	{{dts\|2024-02}}	Google DeepMind	7	6T tokens	Unknown	{{partial success\|Gemma Terms of Use}}{{cite web\|url=https://ai.google.dev/gemma/terms\|title=Gemma\|via=GitHub}}
Claude 3 \|{{dts\|2024-03}} \|Anthropic \| {{Unknown}} \| {{Unknown}} \| {{Unknown}} \|{{proprietary}} \|Includes three models, Haiku, Sonnet, and Opus.{{Cite web \|title=Introducing the next generation of Claude \|url=https://www.anthropic.com/news/claude-3-family \|access-date=2024-03-04 \|website=www.anthropic.com \|archive-date=2024-03-04 \|archive-url=https://web.archive.org/web/20240304143650/https://www.anthropic.com/news/claude-3-family \|url-status=live }}
[https://rubiks.ai/nova/release/ Nova] \|{{dts\|2024-10}} \|[https://rubiks.ai/ Rubik's AI] \| {{Unknown}} \| {{Unknown}} \| {{Unknown}} \|{{proprietary}} \|Previous three models, Nova-Instant, Nova-Air, and Nova-Pro. Company shifted to Sonus AI.
[https://sonus.ai/blog/sonus-1 Sonus]{{Cite web \|title=Sonus AI \|url=https://sonus.ai/ \|access-date=2025-03-07 \|website=sonus.ai \|language=en-US}} \|{{dts\|2025-01}} \|[https://rubiks.ai/ Rubik's AI] \| {{Unknown}} \| {{Unknown}} \| {{Unknown}} \|{{proprietary}} \|
DBRX \|{{dts\|2024-03}} \|Databricks and Mosaic ML \|{{sort\|136\|136}} \|12T Tokens \| \|{{Yes\|Databricks Open Model License}} \|Training cost 10 million USD.
Fugaku-LLM \|{{dts\|2024-05}} \|Fujitsu, Tokyo Institute of Technology, etc. \|{{sort\|13\|13}} \|380B Tokens \| \| \|The largest model ever trained on CPU-only, on the Fugaku.{{Cite web \|title=Fugaku-LLM/Fugaku-LLM-13B · Hugging Face \|url=https://huggingface.co/Fugaku-LLM/Fugaku-LLM-13B \|access-date=2024-05-17 \|website=huggingface.co \|archive-date=2024-05-17 \|archive-url=https://web.archive.org/web/20240517135225/https://huggingface.co/Fugaku-LLM/Fugaku-LLM-13B \|url-status=live }}
Phi-3 \|{{dts\|2024-04}} \|Microsoft \|14{{cite web\|title=Phi-3\|url=https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms\|access-date=2024-04-28\|website=azure.microsoft.com\|date=23 April 2024\|archive-date=2024-04-27\|archive-url=https://web.archive.org/web/20240427043835/https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/\|url-status=live}} \|4.8T Tokens \| \|{{Yes\|MIT}} \|Microsoft markets them as "small language model".{{cite web\|title=Phi-3 Model Documentation\|url=https://huggingface.co/docs/transformers/main/en/model_doc/phi3\|access-date=2024-04-28\|website=huggingface.co\|archive-date=2024-05-13\|archive-url=https://web.archive.org/web/20240513141513/https://huggingface.co/docs/transformers/main/en/model_doc/phi3\|url-status=live}}
Granite Code Models \|{{dts\|2024-05}} \|IBM	Unknown \| {{Unknown}} \| {{Unknown}}	{{yes\|Apache 2.0}} \|
Qwen2 \|{{dts\|2024-06}} \|Alibaba Cloud \|72{{cite web\|title= Qwen2\|website= GitHub\|url= https://github.com/QwenLM/Qwen2?spm=a3c0i.28768018.7084722650.1.5cd35c10NEqBXm&file=Qwen1.5\|access-date= 2024-06-17\|archive-date= 2024-06-17\|archive-url= https://web.archive.org/web/20240617072401/https://github.com/QwenLM/Qwen2?spm=a3c0i.28768018.7084722650.1.5cd35c10NEqBXm&file=Qwen1.5\|url-status= live}} \|3T Tokens \| {{Unknown}} \|{{partial success\|Qwen License}} \|Multiple sizes, the smallest being 0.5B.
DeepSeek-V2 \|{{DTS\|2024-06}} \|DeepSeek \|236 \|8.1T tokens \|{{sort\|28000\|28,000}} \|{{partial success\|DeepSeek License}} \|1.4M hours on H800.{{Citation \|last1=DeepSeek-AI \|title=DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model \|date=2024-06-19 \|arxiv=2405.04434 \|last2=Liu \|first2=Aixin \|last3=Feng \|first3=Bei \|last4=Wang \|first4=Bin \|last5=Wang \|first5=Bingxuan \|last6=Liu \|first6=Bo \|last7=Zhao \|first7=Chenggang \|last8=Dengr \|first8=Chengqi \|last9=Ruan \|first9=Chong}}
Nemotron-4 \|{{dts\|2024-06}} \|Nvidia \|{{sort\|340\|340}} \|9T Tokens \| {{sort\|200000\|200,000}} \|{{Yes\|NVIDIA Open Model License}} \|Trained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024.{{Cite web \|date=2024-06-14 \|title=nvidia/Nemotron-4-340B-Base · Hugging Face \|url=https://huggingface.co/nvidia/Nemotron-4-340B-Base \|access-date=2024-06-15 \|website=huggingface.co \|archive-date=2024-06-15 \|archive-url=https://web.archive.org/web/20240615010323/https://huggingface.co/nvidia/Nemotron-4-340B-Base \|url-status=live }}{{Cite web \|title=Nemotron-4 340B {{!}} Research \|url=https://research.nvidia.com/publication/2024-06_nemotron-4-340b \|access-date=2024-06-15 \|website=research.nvidia.com \|archive-date=2024-06-15 \|archive-url=https://web.archive.org/web/20240615010323/https://research.nvidia.com/publication/2024-06_nemotron-4-340b \|url-status=live }}
Llama 3.1 \|{{dts\|2024-07}} \|Meta AI \|405 \|15.6T tokens \|{{sort\|440000\|440,000}} \| {{partial success\|Llama 3 license}} \|405B version took 31 million hours on H100-80GB, at 3.8E25 FLOPs.[https://ai.meta.com/research/publications/the-llama-3-herd-of-models/ "The Llama 3 Herd of Models" (July 23, 2024) Llama Team, AI @ Meta]{{Cite web \|title=llama-models/models/llama3_1/MODEL_CARD.md at main · meta-llama/llama-models \|url=https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md \|access-date=2024-07-23 \|website=GitHub \|language=en \|archive-date=2024-07-23 \|archive-url=https://web.archive.org/web/20240723151851/https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md \|url-status=live }}
DeepSeek-V3 \|{{dts\|2024-12}} \|DeepSeek \|671 \|14.8T tokens \|{{sort\|56000\|56,000}} \|{{Yes\|MIT}} \|2.788M hours on H800 GPUs.{{Citation \|title=deepseek-ai/DeepSeek-V3 \|date=2024-12-26 \|url=https://github.com/deepseek-ai/DeepSeek-V3?tab=readme-ov-file \|access-date=2024-12-26 \|publisher=DeepSeek}} Originally released under the DeepSeek License, then re-released under the MIT License as "DeepSeek-V3-0324" in March 2025.{{cite web \|last1=Feng \|first1=Coco \|title=DeepSeek wows coders with more powerful open-source V3 model \|url=https://www.scmp.com/tech/big-tech/article/3303798/deepseeks-upgraded-foundational-model-excels-coding-and-maths \|website=South China Morning Post \|access-date=6 April 2025 \|language=en \|date=25 March 2025}}
Amazon Nova \|{{dts\|2024-12}} \|Amazon \| {{Unknown}} \| {{Unknown}} \| {{Unknown}} \|{{proprietary}} \|Includes three models, Nova Micro, Nova Lite, and Nova Pro{{Citation \|title=Amazon Nova Micro, Lite, and Pro - AWS AI Service Cards3 \|date=2024-12-27 \|url=https://docs.aws.amazon.com/ai/responsible-ai/nova-micro-lite-pro/overview.html \|access-date=2024-12-27 \|publisher=Amazon}}
DeepSeek-R1 \|{{dts\|2025-01}} \|DeepSeek \|671 \|{{n/a\|Not applicable}} \| {{Unknown}} \|{{Yes\|MIT}} \|No pretraining. Reinforcement-learned upon V3-Base.{{Citation \|title=deepseek-ai/DeepSeek-R1 \|date=2025-01-21 \|url=https://github.com/deepseek-ai/DeepSeek-R1 \|access-date=2025-01-21 \|publisher=DeepSeek}}{{Citation \|last1=DeepSeek-AI \|title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning \|date=2025-01-22 \|arxiv=2501.12948 \|last2=Guo \|first2=Daya \|last3=Yang \|first3=Dejian \|last4=Zhang \|first4=Haowei \|last5=Song \|first5=Junxiao \|last6=Zhang \|first6=Ruoyu \|last7=Xu \|first7=Runxin \|last8=Zhu \|first8=Qihao \|last9=Ma \|first9=Shirong}}
Qwen2.5 \|{{dts\|2025-01}} \|Alibaba \|72 \|18T tokens \| {{Unknown}} \|{{partial success\|Qwen License}} \|7 dense models, with parameter count from 0.5B to 72B. They also released 2 MoE variants.{{Citation \|last1=Qwen \|title=Qwen2.5 Technical Report \|date=2025-01-03 \|arxiv=2412.15115 \|last2=Yang \|first2=An \|last3=Yang \|first3=Baosong \|last4=Zhang \|first4=Beichen \|last5=Hui \|first5=Binyuan \|last6=Zheng \|first6=Bo \|last7=Yu \|first7=Bowen \|last8=Li \|first8=Chengyuan \|last9=Liu \|first9=Dayiheng}}
MiniMax-Text-01 \|{{dts\|2025-01}} \|Minimax \|456 \|4.7T tokens \| {{Unknown}} \|{{partial success\|Minimax Model license}} \|{{Citation \|title=MiniMax-AI/MiniMax-01 \|date=2025-01-26 \|url=https://github.com/MiniMax-AI/MiniMax-01?tab=readme-ov-file \|access-date=2025-01-26 \|publisher=MiniMax}}{{Citation \|last1=MiniMax \|title=MiniMax-01: Scaling Foundation Models with Lightning Attention \|date=2025-01-14 \|arxiv=2501.08313 \|last2=Li \|first2=Aonian \|last3=Gong \|first3=Bangwei \|last4=Yang \|first4=Bo \|last5=Shan \|first5=Boji \|last6=Liu \|first6=Chang \|last7=Zhu \|first7=Cheng \|last8=Zhang \|first8=Chunhao \|last9=Guo \|first9=Congchao}}
Gemini 2.0 \|{{dts\|2025-02}} \|Google DeepMind	Unknown \| {{Unknown}} \| {{Unknown}}	{{proprietary}} \| Three models released: Flash, Flash-Lite and Pro{{cite web \|last1=Kavukcuoglu \|first1=Koray \|title=Gemini 2.0 is now available to everyone \|url=https://blog.google/technology/google-deepmind/gemini-model-updates-february-2025/ \|website=Google \|date=5 February 2025 \|access-date=6 February 2025}}{{cite web \|title=Gemini 2.0: Flash, Flash-Lite and Pro \|url=https://developers.googleblog.com/en/gemini-2-family-expands/ \|website=Google for Developers \|access-date=6 February 2025}}{{cite news \|last1=Franzen \|first1=Carl \|title=Google launches Gemini 2.0 Pro, Flash-Lite and connects reasoning model Flash Thinking to YouTube, Maps and Search \|url=https://venturebeat.com/ai/google-launches-gemini-2-0-pro-flash-lite-and-connects-reasoning-model-flash-thinking-to-youtube-maps-and-search/ \|access-date=6 February 2025 \|work=VentureBeat \|date=5 February 2025}}
Mistral Large \|{{dts\|2024-11}} \|Mistral AI	123 \| {{Unknown}} \| {{Unknown}} \|{{partial success\|Mistral Research License}} \|Upgraded over time. The latest version is 24.11.{{cite web \|url=https://docs.mistral.ai/getting-started/models/models_overview/ \|title=Models Overview \|website=mistral.ai \|access-date=2025-03-03}}
Pixtral \|{{dts\|2024-11}} \|Mistral AI	123 \| {{Unknown}} \| {{Unknown}} \|{{partial success\|Mistral Research License}} \|Multimodal. There is also a 12B version which is under Apache 2 license.
Grok 3 \|{{dts\|2025-02}} \|xAI \| {{Unknown}} \| {{Unknown}} \| {{Unknown}}, estimated 5,800,000. \|{{proprietary}} \|Training cost claimed "10x the compute of previous state-of-the-art models".{{Cite web \|title=Grok 3 Beta — The Age of Reasoning Agents \|url=https://x.ai/blog/grok-3 \|access-date=2025-02-22 \|website=x.ai \|language=en}}
Llama 4 \|{{dts\|2025-04-05}} \|Meta AI \|{{sort\|400\|400}} \|{{sort\|40000000000000\|40T tokens}} \|	{{partial success\|Llama 4 license}} \|{{Cite web \|date=2025-04-05 \|title=meta-llama/Llama-4-Maverick-17B-128E · Hugging Face \|url=https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E \|access-date=2025-04-06 \|website=huggingface.co}}{{Cite web \|title=The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation \|url=https://ai.meta.com/blog/llama-4-multimodal-intelligence/ \|archive-url=http://web.archive.org/web/20250405185132/https://ai.meta.com/blog/llama-4-multimodal-intelligence/ \|archive-date=2025-04-05 \|access-date=2025-04-05 \|website=ai.meta.com \|language=en}}
Qwen3 \|{{dts\|2025-04}} \|Alibaba Cloud \|235 \|{{sort\|36000000000000\|36T tokens}} \| {{Unknown}} \|{{yes\|Apache 2.0}} \|Multiple sizes, the smallest being 0.6B.{{Cite web \|last=Team \|first=Qwen \|date=2025-04-29 \|title=Qwen3: Think Deeper, Act Faster \|url=https://qwenlm.github.io/blog/qwen3/ \|access-date=2025-04-29 \|website=Qwen \|language=en}}

list of large language models

List

See also

Notes

References