list of large language models

{{Short description|none}}

A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text.

This page lists notable large language models.

List

For the training cost column, 1 petaFLOP-day = 1 petaFLOP/sec × 1 day = 8.64E19 FLOP. Also, only the largest model's cost is written.

{{table alignment}}

{{sort-under}}

class="wikitable sortable sort-under col2right col4right col5right col6right" style="font-size:smaller"
NameRelease date{{efn|This is the date that documentation describing the model's architecture was first released.}}DeveloperNumber of parameters (billion) {{efn|In many cases, researchers release or report on multiple versions of a model having different sizes. In these cases, the size of the largest model is listed here.}}Corpus size

!Training cost (petaFLOP-day)

License{{efn|This is the license of the pre-trained model weights. In almost all cases the training code itself is open-source or can be easily replicated.}}Notes
Attention Is All You Need

|{{dts|2017-06}}

|Vaswani et al at Google

|0.213

|36 million English-French sentence pairs

|0.09{{Cite web |date=2022-06-09 |title=AI and compute |url=https://openai.com/index/ai-and-compute/ |access-date=2025-04-24 |website=openai.com |language=en-US}}

|

|Trained for 0.3M steps on 8 NVIDIA P100 GPUs.

GPT-1{{dts|2018-06}}OpenAI{{sort|0.117|0.117}}| 1{{cite web |date=June 11, 2018 |title=Improving language understanding with unsupervised learning |url=https://openai.com/research/language-unsupervised |url-status=live |archive-url=https://web.archive.org/web/20230318210736/https://openai.com/research/language-unsupervised |archive-date=2023-03-18 |access-date=2023-03-18 |website=openai.com }}{{yes|MIT}}{{cite web|work=GitHub|title=finetune-transformer-lm|url=https://github.com/openai/finetune-transformer-lm|access-date=2 January 2024|archive-date=19 May 2023|archive-url=https://web.archive.org/web/20230519062127/https://github.com/openai/finetune-transformer-lm|url-status=live}}

| First GPT model, decoder-only transformer. Trained for 30 days on 8 P600 GPUs.

BERT{{dts|2018-10}}Google{{sort|0.340|0.340}}{{cite arXiv |last1=Devlin |first1=Jacob |last2=Chang |first2=Ming-Wei |last3=Lee |first3=Kenton |last4=Toutanova |first4=Kristina |title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |date=11 October 2018 |eprint=1810.04805v2|class=cs.CL }}{{sort|3300000000|3.3 billion}} words

|{{sort|9|9}}{{Cite web |last=Prickett |first=Nicole Hemsoth |date=2021-08-24 |title=Cerebras Shifts Architecture To Meet Massive AI/ML Models |url=https://www.nextplatform.com/2021/08/24/cerebras-shifts-architecture-to-meet-massive-ai-ml-models/ |access-date=2023-06-20 |website=The Next Platform |archive-date=2023-06-20 |archive-url=https://web.archive.org/web/20230620151619/https://www.nextplatform.com/2021/08/24/cerebras-shifts-architecture-to-meet-massive-ai-ml-models/ |url-status=live }}

{{yes|Apache 2.0}}{{Cite web|url=https://github.com/google-research/bert|title=BERT|date=March 13, 2023|via=GitHub|access-date=March 13, 2023|archive-date=January 13, 2021|archive-url=https://web.archive.org/web/20210113211317/https://github.com/google-research/bert|url-status=live}}

| An early and influential language model.{{cite journal |last=Manning |first=Christopher D. |author-link=Christopher D. Manning |year=2022 |title=Human Language Understanding & Reasoning |url=https://www.amacad.org/publication/human-language-understanding-reasoning |journal=Daedalus |volume=151 |issue=2 |pages=127–138 |doi=10.1162/daed_a_01905 |s2cid=248377870 |doi-access=free |access-date=2023-03-09 |archive-date=2023-11-17 |archive-url=https://web.archive.org/web/20231117205531/https://www.amacad.org/publication/human-language-understanding-reasoning |url-status=live }}Encoder-only and thus not built to be prompted or generative.{{cite arXiv |last1=Patel |first1=Ajay |last2=Li |first2=Bryan |last3=Rasooli |first3=Mohammad Sadegh |last4=Constant |first4=Noah |last5=Raffel |first5=Colin |last6=Callison-Burch |first6=Chris |title=Bidirectional Language Models Are Also Few-shot Learners |date=2022 |class=cs.LG |eprint=2209.14500}} Training took 4 days on 64 TPUv2 chips.{{cite arXiv |eprint=1810.04805v2 |class=cs.CL |first1=Jacob |last1=Devlin |first2=Ming-Wei |last2=Chang |title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |date=11 October 2018 |last3=Lee |first3=Kenton |last4=Toutanova |first4=Kristina}}

T5

|{{dts|2019-10}}

|Google

|{{sort|11|11}}{{Cite journal |last1=Raffel |first1=Colin |last2=Shazeer |first2=Noam |last3=Roberts |first3=Adam |last4=Lee |first4=Katherine |last5=Narang |first5=Sharan |last6=Matena |first6=Michael |last7=Zhou |first7=Yanqi |last8=Li |first8=Wei |last9=Liu |first9=Peter J. |date=2020 |title=Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer |url=http://jmlr.org/papers/v21/20-074.html |journal=Journal of Machine Learning Research |volume=21 |issue=140 |pages=1–67 |arxiv=1910.10683 |issn=1533-7928}}

|34 billion tokens

|

| {{yes|Apache 2.0}}{{Citation |title=google-research/text-to-text-transfer-transformer |date=2024-04-02 |url=https://github.com/google-research/text-to-text-transfer-transformer |access-date=2024-04-04 |publisher=Google Research |archive-date=2024-03-29 |archive-url=https://web.archive.org/web/20240329112957/https://github.com/google-research/text-to-text-transfer-transformer |url-status=live }}

|Base model for many Google projects, such as Imagen.{{Cite web |title=Imagen: Text-to-Image Diffusion Models |url=https://imagen.research.google/ |access-date=2024-04-04 |website=imagen.research.google |archive-date=2024-03-27 |archive-url=https://web.archive.org/web/20240327201713/https://imagen.research.google/ |url-status=live }}

XLNet{{dts|2019-06}}Google{{sort|0.340|0.340}}{{Cite web |title=Pretrained models — transformers 2.0.0 documentation |url=https://huggingface.co/transformers/v2.0.0/pretrained_models.html |access-date=2024-08-05 |website=huggingface.co |archive-date=2024-08-05 |archive-url=https://web.archive.org/web/20240805032110/https://huggingface.co/transformers/v2.0.0/pretrained_models.html |url-status=live }}{{sort|3300000000|33}} billion words

| 330

{{yes|Apache 2.0}}{{cite web|work=GitHub|title=xlnet|url=https://github.com/zihangdai/xlnet/|access-date=2 January 2024|archive-date=2 January 2024|archive-url=https://web.archive.org/web/20240102191842/https://github.com/zihangdai/xlnet/|url-status=live}}

| An alternative to BERT; designed as encoder-only. Trained on 512 TPU v3 chips for 5.5 days.{{cite arXiv |last1=Yang |first1=Zhilin |last2=Dai |first2=Zihang |last3=Yang |first3=Yiming |last4=Carbonell |first4=Jaime |last5=Salakhutdinov |first5=Ruslan |last6=Le |first6=Quoc V. |title=XLNet: Generalized Autoregressive Pretraining for Language Understanding |date=2 January 2020 |class=cs.CL |eprint=1906.08237}}

GPT-2{{dts|2019-02}}OpenAI{{sort|1.5|1.5}}{{Cite web |url = https://openai.com/blog/gpt-2-1-5b-release/ |title = GPT-2: 1.5B Release |date = 2019-11-05 |website = OpenAI |language = en |access-date = 2019-11-14 |archive-date = 2019-11-14 |archive-url = https://web.archive.org/web/20191114074358/https://openai.com/blog/gpt-2-1-5b-release/ |url-status = live}}40GB{{cite web |title=Better language models and their implications |url=https://openai.com/research/better-language-models |website=openai.com |access-date=2023-03-13 |archive-date=2023-03-16 |archive-url=https://web.archive.org/web/20230316160730/https://openai.com/research/better-language-models |url-status=live }} (~{{sort|10000000000|10 billion}} tokens){{cite web |title=OpenAI's GPT-3 Language Model: A Technical Overview |url=https://lambdalabs.com/blog/demystifying-gpt-3 |website=lambdalabs.com |date=3 June 2020 |access-date=13 March 2023 |archive-date=27 March 2023 |archive-url=https://web.archive.org/web/20230327213811/https://lambdalabs.com/blog/demystifying-gpt-3 |url-status=live }}

| 28{{Cite web |title=openai-community/gpt2-xl · Hugging Face |url=https://huggingface.co/openai-community/gpt2-xl |access-date=2024-07-24 |website=huggingface.co |archive-date=2024-07-24 |archive-url=https://web.archive.org/web/20240724041702/https://huggingface.co/openai-community/gpt2-xl |url-status=live }}

{{yes|MIT}}{{cite web|work=GitHub|title=gpt-2|url=https://github.com/openai/gpt-2|access-date=13 March 2023|archive-date=11 March 2023|archive-url=https://web.archive.org/web/20230311154936/https://github.com/openai/gpt-2|url-status=live}}

| Trained on 32 TPUv3 chips for 1 week.

GPT-3{{dts|2020-05}}OpenAI{{sort|175|175}}{{cite web |last=Wiggers |first=Kyle |date=28 April 2022 |title=The emerging types of language models and why they matter |url=https://techcrunch.com/2022/04/28/the-emerging-types-of-language-models-and-why-they-matter/ |work=TechCrunch |access-date=9 March 2023 |archive-date=16 March 2023 |archive-url=https://web.archive.org/web/20230316072443/https://techcrunch.com/2022/04/28/the-emerging-types-of-language-models-and-why-they-matter/ |url-status=live }}{{sort|300000000000|300 billion}} tokens

|3640Table D.1 in {{Cite arXiv |last1=Brown |first1=Tom B. |last2=Mann |first2=Benjamin |last3=Ryder |first3=Nick |last4=Subbiah |first4=Melanie |last5=Kaplan |first5=Jared |last6=Dhariwal |first6=Prafulla |last7=Neelakantan |first7=Arvind |last8=Shyam |first8=Pranav |last9=Sastry |first9=Girish |last10=Askell |first10=Amanda |last11=Agarwal |first11=Sandhini |last12=Herbert-Voss |first12=Ariel |last13=Krueger |first13=Gretchen |last14=Henighan |first14=Tom |last15=Child |first15=Rewon |date=May 28, 2020 |title=Language Models are Few-Shot Learners |eprint=2005.14165v4 |first16=Aditya |last16=Ramesh |first17=Daniel M. |last17=Ziegler |first18=Jeffrey |last18=Wu |first19=Clemens |last19=Winter |first20=Christopher |last20=Hesse |first21=Mark |last21=Chen |first22=Eric |last22=Sigler |first23=Mateusz |last23=Litwin |first24=Scott |last24=Gray |first25=Benjamin |last25=Chess |first26=Jack |last26=Clark |first27=Christopher |last27=Berner |first28=Sam |last28=McCandlish |first29=Alec |last29=Radford |first30=Ilya |last30=Sutskever |first31=Dario |last31=Amodei|class=cs.CL}}

{{proprietary}}

| A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.{{Cite web |date=2022-11-30 |title=ChatGPT: Optimizing Language Models for Dialogue |url=https://openai.com/blog/chatgpt/ |access-date=2023-01-13 |website=OpenAI |archive-date=2022-11-30 |archive-url=https://web.archive.org/web/20221130180912/https://openai.com/blog/chatgpt/ |url-status=live}}

GPT-Neo{{dts|2021-03}}EleutherAI{{sort|2.7|2.7}}{{Cite web|url=https://github.com/EleutherAI/gpt-neo|title=GPT Neo|date=March 15, 2023|via=GitHub|access-date=March 12, 2023|archive-date=March 12, 2023|archive-url=https://web.archive.org/web/20230312225202/https://github.com/EleutherAI/gpt-neo|url-status=live}}825 GiB{{cite arXiv |last1=Gao |first1=Leo |last2=Biderman |first2=Stella |last3=Black |first3=Sid |last4=Golding |first4=Laurence |last5=Hoppe |first5=Travis |last6=Foster |first6=Charles |last7=Phang |first7=Jason |last8=He |first8=Horace |last9=Thite |first9=Anish |last10=Nabeshima |first10=Noa |last11=Presser |first11=Shawn |last12=Leahy |first12=Connor |title=The Pile: An 800GB Dataset of Diverse Text for Language Modeling |eprint=2101.00027|date=31 December 2020 |class=cs.CL}}

|

{{yes|MIT}}

| The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.{{cite web |work=VentureBeat |last=Iyer |first=Abhishek |title=GPT-3's free alternative GPT-Neo is something to be excited about |date=15 May 2021 |url=https://venturebeat.com/ai/gpt-3s-free-alternative-gpt-neo-is-something-to-be-excited-about/ |access-date=13 March 2023 |archive-date=9 March 2023 |archive-url=https://web.archive.org/web/20230309012717/https://venturebeat.com/ai/gpt-3s-free-alternative-gpt-neo-is-something-to-be-excited-about/ |url-status=live}}

GPT-J{{dts|2021-06}}EleutherAI{{sort|6|6}}{{Cite web |title=GPT-J-6B: An Introduction to the Largest Open Source GPT Model {{!}} Forefront |url=https://www.forefront.ai/blog-posts/gpt-j-6b-an-introduction-to-the-largest-open-sourced-gpt-model |access-date=2023-02-28 |website=www.forefront.ai |archive-date=2023-03-09 |archive-url=https://web.archive.org/web/20230309205439/https://www.forefront.ai/blog-posts/gpt-j-6b-an-introduction-to-the-largest-open-sourced-gpt-model |url-status=dead }}825 GiB

|200{{Cite arXiv |last1=Dey |first1=Nolan |last2=Gosal |first2=Gurpreet |last3=Zhiming |last4=Chen |last5=Khachane |first5=Hemant |last6=Marshall |first6=William |last7=Pathria |first7=Ribhu |last8=Tom |first8=Marvin |last9=Hestness |first9=Joel |date=2023-04-01 |title=Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster |class=cs.LG |eprint=2304.03208}}

{{yes|Apache 2.0}}

| GPT-3-style language model

Megatron-Turing NLG{{dts|2021-10}} {{cite web |last1=Alvi |first1=Ali |last2=Kharya |first2=Paresh |title=Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World's Largest and Most Powerful Generative Language Model |url=https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/ |website=Microsoft Research |date=11 October 2021 |access-date=13 March 2023 |archive-date=13 March 2023 |archive-url=https://web.archive.org/web/20230313180531/https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/ |url-status=live }}Microsoft and Nvidia{{sort|530|530}}{{Cite arXiv |last1=Smith |first1=Shaden |last2=Patwary |first2=Mostofa |last3=Norick |first3=Brandon |last4=LeGresley |first4=Patrick |last5=Rajbhandari |first5=Samyam |last6=Casper |first6=Jared |last7=Liu |first7=Zhun |last8=Prabhumoye |first8=Shrimai |last9=Zerveas |first9=George |last10=Korthikanti |first10=Vijay |last11=Zhang |first11=Elton |last12=Child |first12=Rewon |last13=Aminabadi |first13=Reza Yazdani |last14=Bernauer |first14=Julie |last15=Song |first15=Xia |date=2022-02-04 |title=Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model |class=cs.CL |eprint=2201.11990}}{{sort|338600000000|338.6 billion}} tokens

| 38000

{{no|Restricted web access}}

| Trained for 3 months on over 2000 A100 GPUs on the NVIDIA Selene Supercomputer, for over 3 million GPU-hours.{{Citation |last1=Rajbhandari |first1=Samyam |title=DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale |date=2022-07-21 |arxiv=2201.05596 |last2=Li |first2=Conglong |last3=Yao |first3=Zhewei |last4=Zhang |first4=Minjia |last5=Aminabadi |first5=Reza Yazdani |last6=Awan |first6=Ammar Ahmad |last7=Rasley |first7=Jeff |last8=He |first8=Yuxiong}}

Ernie 3.0 Titan{{dts|2021-12}}Baidu{{sort|260|260}}{{Cite arXiv|title=ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation|first1=Shuohuan|last1=Wang|first2=Yu|last2=Sun|first3=Yang|last3=Xiang|first4=Zhihua|last4=Wu|first5=Siyu|last5=Ding|first6=Weibao|last6=Gong|first7=Shikun|last7=Feng|first8=Junyuan|last8=Shang|first9=Yanbin|last9=Zhao|first10=Chao|last10=Pang|first11=Jiaxiang|last11=Liu|first12=Xuyi|last12=Chen|first13=Yuxiang|last13=Lu|first14=Weixin|last14=Liu|first15=Xi|last15=Wang|first16=Yangfan|last16=Bai|first17=Qiuliang|last17=Chen|first18=Li|last18=Zhao|first19=Shiyong|last19=Li|first20=Peng|last20=Sun|first21=Dianhai|last21=Yu|first22=Yanjun|last22=Ma|first23=Hao|last23=Tian|first24=Hua|last24=Wu|first25=Tian|last25=Wu|first26=Wei|last26=Zeng|first27=Ge|last27=Li|first28=Wen|last28=Gao|first29=Haifeng|last29=Wang|date=December 23, 2021|class=cs.CL |eprint=2112.12731}}4 Tb

|

{{proprietary}}

| Chinese-language LLM. Ernie Bot is based on this model.

Claude{{cite web |title=Product |url=https://www.anthropic.com/product |website=Anthropic |access-date=14 March 2023 |archive-date=16 March 2023 |archive-url=https://web.archive.org/web/20230316145444/https://www.anthropic.com/product |url-status=live }}{{dts|2021-12}}Anthropic{{sort|52|52}}{{cite arXiv |last1=Askell |first1=Amanda |last2=Bai |first2=Yuntao |last3=Chen |first3=Anna |last4=Drain |first4=Dawn |last5=Ganguli |first5=Deep |last6=Henighan |first6=Tom |last7=Jones |first7=Andy |last8=Joseph |first8=Nicholas |last9=Mann |first9=Ben |last10=DasSarma |first10=Nova |last11=Elhage |first11=Nelson |last12=Hatfield-Dodds |first12=Zac |last13=Hernandez |first13=Danny |last14=Kernion |first14=Jackson |last15=Ndousse |first15=Kamal |last16=Olsson |first16=Catherine |last17=Amodei |first17=Dario |last18=Brown |first18=Tom |last19=Clark |first19=Jack |last20=McCandlish |first20=Sam |last21=Olah |first21=Chris |last22=Kaplan |first22=Jared |display-authors=3 |title=A General Language Assistant as a Laboratory for Alignment |eprint=2112.00861 |date=9 December 2021 |class=cs.CL}}{{sort|400000000000|400 billion}} tokens

|

{{partial success|beta}}

| Fine-tuned for desirable behavior in conversations.{{cite arXiv |last1=Bai |first1=Yuntao |last2=Kadavath |first2=Saurav |last3=Kundu |first3=Sandipan |last4=Askell |first4=Amanda |last5=Kernion |first5=Jackson |last6=Jones |first6=Andy |last7=Chen |first7=Anna |last8=Goldie |first8=Anna |last9=Mirhoseini |first9=Azalia |last10=McKinnon |first10=Cameron |last11=Chen |first11=Carol |last12=Olsson |first12=Catherine |last13=Olah |first13=Christopher |last14=Hernandez |first14=Danny |last15=Drain |first15=Dawn |last16=Ganguli |first16=Deep |last17=Li |first17=Dustin |last18=Tran-Johnson |first18=Eli |last19=Perez |first19=Ethan |last20=Kerr |first20=Jamie |last21=Mueller |first21=Jared |last22=Ladish |first22=Jeffrey |last23=Landau |first23=Joshua |last24=Ndousse |first24=Kamal |last25=Lukosuite |first25=Kamile |last26=Lovitt |first26=Liane |last27=Sellitto |first27=Michael |last28=Elhage |first28=Nelson |last29=Schiefer |first29=Nicholas |last30=Mercado |first30=Noemi |last31=DasSarma |first31=Nova |last32=Lasenby |first32=Robert |last33=Larson |first33=Robin |last34=Ringer |first34=Sam |last35=Johnston |first35=Scott |last36=Kravec |first36=Shauna |last37=Showk |first37=Sheer El |last38=Fort |first38=Stanislav |last39=Lanham |first39=Tamera |last40=Telleen-Lawton |first40=Timothy |last41=Conerly |first41=Tom |last42=Henighan |first42=Tom |last43=Hume |first43=Tristan |last44=Bowman |first44=Samuel R. |last45=Hatfield-Dodds |first45=Zac |last46=Mann |first46=Ben |last47=Amodei |first47=Dario |last48=Joseph |first48=Nicholas |last49=McCandlish |first49=Sam |last50=Brown |first50=Tom |last51=Kaplan |first51=Jared |display-authors=3 |title=Constitutional AI: Harmlessness from AI Feedback |eprint=2212.08073 |date=15 December 2022 |class=cs.CL}}

GLaM (Generalist Language Model){{dts|2021-12}}Google{{sort|1200|1200}}{{sort|1600000000000|1.6 trillion}} tokens{{Cite web |last1=Dai |first1=Andrew M |last2=Du |first2=Nan |date=December 9, 2021 |title=More Efficient In-Context Learning with GLaM |url=https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html |access-date=2023-03-09 |website=ai.googleblog.com |archive-date=2023-03-12 |archive-url=https://web.archive.org/web/20230312072042/https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html |url-status=live}}

| 5600

{{proprietary}}

| Sparse mixture of experts model, making it more expensive to train but cheaper to run inference compared to GPT-3.

Gopher{{dts|2021-12}}DeepMind{{sort|280|280}}{{cite web |title=Language modelling at scale: Gopher, ethical considerations, and retrieval |url=https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval |website=www.deepmind.com |date=8 December 2021 |access-date=20 March 2023 |archive-date=20 March 2023 |archive-url=https://web.archive.org/web/20230320082323/https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval |url-status=live }}{{sort|300000000000|300 billion}} tokens

|5833Table 20 and page 66 of [https://storage.googleapis.com/pathways-language-model/PaLM-paper.pdf PaLM: Scaling Language Modeling with Pathways] {{Webarchive|url=https://web.archive.org/web/20230610040050/https://storage.googleapis.com/pathways-language-model/PaLM-paper.pdf |date=2023-06-10 }}

{{proprietary}}

| Later developed into the Chinchilla model.

LaMDA (Language Models for Dialog Applications){{dts|2022-01}}Google{{sort|137|137}}1.56T words,{{Cite web |last1=Cheng |first1=Heng-Tze |last2=Thoppilan |first2=Romal |date=January 21, 2022 |title=LaMDA: Towards Safe, Grounded, and High-Quality Dialog Models for Everything |url=https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html |access-date=2023-03-09 |website=ai.googleblog.com |archive-date=2022-03-25 |archive-url=https://web.archive.org/web/20220325014118/https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html |url-status=live}} {{sort|168000000000|168 billion}} tokens

|4110{{Cite arXiv |last1=Thoppilan |first1=Romal |last2=De Freitas |first2=Daniel |last3=Hall |first3=Jamie |last4=Shazeer |first4=Noam |last5=Kulshreshtha |first5=Apoorv |last6=Cheng |first6=Heng-Tze |last7=Jin |first7=Alicia |last8=Bos |first8=Taylor |last9=Baker |first9=Leslie |last10=Du |first10=Yu |last11=Li |first11=YaGuang |last12=Lee |first12=Hongrae |last13=Zheng |first13=Huaixiu Steven |last14=Ghafouri |first14=Amin |last15=Menegali |first15=Marcelo |date=2022-01-01 |title=LaMDA: Language Models for Dialog Applications |class=cs.CL |eprint=2201.08239}}

{{proprietary}}

| Specialized for response generation in conversations.

GPT-NeoX{{dts|2022-02}}EleutherAI{{sort|20|20}}{{cite conference |title=GPT-NeoX-20B: An Open-Source Autoregressive Language Model |conference=Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models |date=2022-05-01 |last1=Black |first1=Sidney |last2=Biderman |first2=Stella |last3=Hallahan |first3=Eric |display-authors=etal |volume=Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models |pages=95–136 |url=https://aclanthology.org/2022.bigscience-1.9/ |access-date=2022-12-19 |archive-date=2022-12-10 |archive-url=https://web.archive.org/web/20221210082456/https://aclanthology.org/2022.bigscience-1.9/ |url-status=live }}825 GiB

|740

{{yes|Apache 2.0}}

| based on the Megatron architecture

Chinchilla{{dts|2022-03}}DeepMind{{sort|70|70}}{{sort|1400000000000|1.4 trillion}} tokens{{cite web |work=Deepmind Blog |title=An empirical analysis of compute-optimal large language model training |first1=Jordan |last1=Hoffmann |first2=Sebastian |last2=Borgeaud |first3=Arthur |last3=Mensch |first4=Laurent |last4=Sifre |date=12 April 2022 |url=https://www.deepmind.com/blog/an-empirical-analysis-of-compute-optimal-large-language-model-training |access-date=9 March 2023 |archive-date=13 April 2022 |archive-url=https://web.archive.org/web/20220413014510/https://www.deepmind.com/blog/an-empirical-analysis-of-compute-optimal-large-language-model-training |url-status=live}}{{cite arXiv |last1=Hoffmann |first1=Jordan |last2=Borgeaud |first2=Sebastian |last3=Mensch |first3=Arthur |last4=Buchatskaya |first4=Elena |last5=Cai |first5=Trevor |last6=Rutherford |first6=Eliza |last7=Casas |first7=Diego de Las |last8=Hendricks |first8=Lisa Anne |last9=Welbl |first9=Johannes |last10=Clark |first10=Aidan |last11=Hennigan |first11=Tom |last12=Noland |first12=Eric |last13=Millican |first13=Katie |last14=Driessche |first14=George van den |last15=Damoc |first15=Bogdan |last16=Guy |first16=Aurelia |last17=Osindero |first17=Simon |last18=Simonyan |first18=Karen |last19=Elsen |first19=Erich |last20=Rae |first20=Jack W. |last21=Vinyals |first21=Oriol |last22=Sifre |first22=Laurent |title=Training Compute-Optimal Large Language Models |eprint=2203.15556 |date=29 March 2022 |class=cs.CL |display-authors=3}}

|6805

{{proprietary}}

| Reduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law.

PaLM (Pathways Language Model){{dts|2022-04}}Google{{sort|540|540}}{{Cite web |last1=Narang |first1=Sharan |last2=Chowdhery |first2=Aakanksha |date=April 4, 2022 |title=Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance |url=https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html |access-date=2023-03-09 |website=ai.googleblog.com |language=en |archive-date=2022-04-04 |archive-url=https://web.archive.org/web/20220404161447/https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html |url-status=live}}{{sort|768000000000|768 billion}} tokens

|{{sort|29250|29,250}}

{{proprietary}}

| Trained for ~60 days on ~6000 TPU v4 chips. {{As of|2024|October}}, it is the largest dense Transformer published.

OPT (Open Pretrained Transformer){{dts|2022-05}}Meta{{sort|175|175}}{{cite web |title=Democratizing access to large-scale language models with OPT-175B |author1=Susan Zhang |author2=Mona Diab |author3=Luke Zettlemoyer |url=https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/ |website=ai.facebook.com |access-date=2023-03-12 |archive-date=2023-03-12 |archive-url=https://web.archive.org/web/20230312231820/https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/ |url-status=live }}{{sort|180000000000|180 billion}} tokens{{cite arXiv |last1=Zhang |first1=Susan |last2=Roller |first2=Stephen |last3=Goyal |first3=Naman |last4=Artetxe |first4=Mikel |last5=Chen |first5=Moya |last6=Chen |first6=Shuohui |last7=Dewan |first7=Christopher |last8=Diab |first8=Mona |last9=Li |first9=Xian |last10=Lin |first10=Xi Victoria |last11=Mihaylov |first11=Todor |last12=Ott |first12=Myle |last13=Shleifer |first13=Sam |last14=Shuster |first14=Kurt |last15=Simig |first15=Daniel |last16=Koura |first16=Punit Singh |last17=Sridhar |first17=Anjali |last18=Wang |first18=Tianlu |last19=Zettlemoyer |first19=Luke |title=OPT: Open Pre-trained Transformer Language Models |eprint=2205.01068 |date=21 June 2022|class=cs.CL}}

|310

{{partial success|Non-commercial research}}{{efn|The smaller models including 66B are publicly available, while the 175B model is available on request.}}

| GPT-3 architecture with some adaptations from Megatron. Uniquely, the training logbook written by the team was published.{{Cite web |title=metaseq/projects/OPT/chronicles at main · facebookresearch/metaseq |url=https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/chronicles |access-date=2024-10-18 |website=GitHub |language=en}}

YaLM 100B{{dts|2022-06}}Yandex{{sort|100|100}}{{Citation |last1=Khrushchev |first1=Mikhail |title=YaLM 100B |date=2022-06-22 |url=https://github.com/yandex/YaLM-100B |access-date=2023-03-18 |last2=Vasilev |first2=Ruslan |last3=Petrov |first3=Alexey |last4=Zinov |first4=Nikolay |archive-date=2023-06-16 |archive-url=https://web.archive.org/web/20230616050056/https://github.com/yandex/YaLM-100B |url-status=live }}1.7TB|{{Yes|Apache 2.0}}English-Russian model based on Microsoft's Megatron-LM.
Minerva{{dts|2022-06}}Google{{sort|540|540}}38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server{{cite arXiv |last1=Lewkowycz |first1=Aitor |last2=Andreassen |first2=Anders |last3=Dohan |first3=David |last4=Dyer |first4=Ethan |last5=Michalewski |first5=Henryk |last6=Ramasesh |first6=Vinay |last7=Slone |first7=Ambrose |last8=Anil |first8=Cem |last9=Schlag |first9=Imanol |last10=Gutman-Solo |first10=Theo |last11=Wu |first11=Yuhuai |last12=Neyshabur |first12=Behnam |last13=Gur-Ari |first13=Guy |last14=Misra |first14=Vedant |title=Solving Quantitative Reasoning Problems with Language Models |date=30 June 2022 |class=cs.CL |eprint=2206.14858}}

|

{{proprietary}}

| For solving "mathematical and scientific questions using step-by-step reasoning".{{cite web |title=Minerva: Solving Quantitative Reasoning Problems with Language Models |url=https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html |website=ai.googleblog.com |date=30 June 2022 |access-date=20 March 2023 }} Initialized from PaLM models, then finetuned on mathematical and scientific data.

BLOOM{{dts|2022-07}}Large collaboration led by Hugging Face{{sort|175|175}}{{cite journal |journal=Nature |last=Ananthaswamy |first=Anil |title=In AI, is bigger always better? |date=8 March 2023 |volume=615 |issue=7951 |pages=202–205 |doi=10.1038/d41586-023-00641-w |pmid=36890378 |bibcode=2023Natur.615..202A |s2cid=257380916 |url=https://www.nature.com/articles/d41586-023-00641-w |access-date=9 March 2023 |archive-date=16 March 2023 |archive-url=https://web.archive.org/web/20230316181013/https://www.nature.com/articles/d41586-023-00641-w |url-status=live }}{{sort|350000000000|350 billion}} tokens (1.6TB){{cite web |title=bigscience/bloom · Hugging Face |url=https://huggingface.co/bigscience/bloom |website=huggingface.co |access-date=2023-03-13 |archive-date=2023-04-12 |archive-url=https://web.archive.org/web/20230412002547/https://huggingface.co/bigscience/bloom |url-status=live }}

|

{{partial success|Responsible AI}}

| Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages)

Galactica{{dts|2022-11}}Meta{{sort|120|120}}{{sort|350000000000|106 billion}} tokens{{cite arXiv |last1=Taylor |first1=Ross |last2=Kardas |first2=Marcin |last3=Cucurull |first3=Guillem |last4=Scialom |first4=Thomas |last5=Hartshorn |first5=Anthony |last6=Saravia |first6=Elvis |last7=Poulton |first7=Andrew |last8=Kerkez |first8=Viktor |last9=Stojnic |first9=Robert |title=Galactica: A Large Language Model for Science |date=16 November 2022 |class=cs.CL |eprint=2211.09085}}

| {{Unknown}}

{{partial success|CC-BY-NC-4.0}}

| Trained on scientific text and modalities.

AlexaTM (Teacher Models){{dts|2022-11}}Amazon{{sort|20|20}}{{cite web |title=20B-parameter Alexa model sets new marks in few-shot learning |url=https://www.amazon.science/blog/20b-parameter-alexa-model-sets-new-marks-in-few-shot-learning |website=Amazon Science |date=2 August 2022 |access-date=12 March 2023 |archive-date=15 March 2023 |archive-url=https://web.archive.org/web/20230315190223/https://www.amazon.science/blog/20b-parameter-alexa-model-sets-new-marks-in-few-shot-learning |url-status=live }}{{sort|1300000000000|1.3 trillion}}{{cite arXiv |last1=Soltan |first1=Saleh |last2=Ananthakrishnan |first2=Shankar |last3=FitzGerald |first3=Jack |last4=Gupta |first4=Rahul |last5=Hamza |first5=Wael |last6=Khan |first6=Haidar |last7=Peris |first7=Charith |last8=Rawls |first8=Stephen |last9=Rosenbaum |first9=Andy |last10=Rumshisky |first10=Anna |last11=Prakash |first11=Chandana Satya |last12=Sridhar |first12=Mukund |last13=Triefenbach |first13=Fabian |last14=Verma |first14=Apurv |last15=Tur |first15=Gokhan |last16=Natarajan |first16=Prem |display-authors=3|title=AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model |eprint=2208.01448 |date=3 August 2022|class=cs.CL}}

|

{{proprietary}}{{cite web |title=AlexaTM 20B is now available in Amazon SageMaker JumpStart {{!}} AWS Machine Learning Blog |url=https://aws.amazon.com/blogs/machine-learning/alexatm-20b-is-now-available-in-amazon-sagemaker-jumpstart/ |website=aws.amazon.com |access-date=13 March 2023 |date=17 November 2022 |archive-date=13 March 2023 |archive-url=https://web.archive.org/web/20230313163933/https://aws.amazon.com/blogs/machine-learning/alexatm-20b-is-now-available-in-amazon-sagemaker-jumpstart/ |url-status=live }}

| bidirectional sequence-to-sequence architecture

LLaMA (Large Language Model Meta AI){{dts|2023-02}}Meta AI{{sort|65|65}}{{sort|1400000000000|1.4 trillion}}{{cite web |work=Meta AI |title=Introducing LLaMA: A foundational, 65-billion-parameter large language model |date=24 February 2023 |url=https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ |access-date=9 March 2023 |archive-date=3 March 2023 |archive-url=https://web.archive.org/web/20230303112302/https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ |url-status=live}}

|6300{{Cite web |title=The Falcon has landed in the Hugging Face ecosystem |url=https://huggingface.co/blog/falcon |access-date=2023-06-20 |website=huggingface.co |archive-date=2023-06-20 |archive-url=https://web.archive.org/web/20230620002832/https://huggingface.co/blog/falcon |url-status=live }}

{{partial success|Non-commercial research}}{{efn|Facebook's license and distribution scheme restricted access to approved researchers, but the model weights were leaked and became widely available.}}

| Corpus has 20 languages. "Overtrained" (compared to Chinchilla scaling law) for better performance with fewer parameters.

GPT-4{{dts|2023-03}}OpenAI{{Unknown}}{{efn|As stated in Technical report: "Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method ..."{{Cite web |date=2023 |title=GPT-4 Technical Report |url=https://cdn.openai.com/papers/gpt-4.pdf |website=OpenAI |access-date=March 14, 2023 |archive-date=March 14, 2023 |archive-url=https://web.archive.org/web/20230314190904/https://cdn.openai.com/papers/gpt-4.pdf |url-status=live}} }}
(According to rumors: 1760){{Cite web |last=Schreiner |first=Maximilian |date=2023-07-11 |title=GPT-4 architecture, datasets, costs and more leaked |url=https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/ |access-date=2024-07-26 |website=THE DECODER |language=en-US |archive-date=2023-07-12 |archive-url=https://web.archive.org/web/20230712123915/https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/ |url-status=live }}

| {{Unknown}}

| {{Unknown}},
estimated 230,000.

{{proprietary}}

| Available for ChatGPT Plus users and used in several products.

Chameleon{{dts|2024-06}}Meta AI{{sort|34|34}}{{cite news |last1=Dickson |first1=Ben |title=Meta introduces Chameleon, a state-of-the-art multimodal model |url=https://venturebeat.com/ai/meta-introduces-chameleon-a-state-of-the-art-multimodal-model/ |work=VentureBeat |date=22 May 2024}}{{sort|4400000000000|4.4 trillion}}
Cerebras-GPT

|{{dts|2023-03}}

|Cerebras

|{{sort|13|13}}{{Cite web|url=https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/|title=Cerebras-GPT: A Family of Open, Compute-efficient, Large Language Models|first=Nolan|last=Dey|date=March 28, 2023|website=Cerebras|access-date=March 28, 2023|archive-date=March 28, 2023|archive-url=https://web.archive.org/web/20230328213339/https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/|url-status=live}}

|

|270

{{yes|Apache 2.0}}

| Trained with Chinchilla formula.

Falcon{{dts|2023-03}}Technology Innovation Institute{{sort|40|40}}{{cite web |title=Abu Dhabi-based TII launches its own version of ChatGPT |url=https://fastcompanyme.com/news/abu-dhabi-based-tii-launches-its-own-version-of-chatgpt/ |website=tii.ae |access-date=2023-04-03 |archive-date=2023-04-03 |archive-url=https://web.archive.org/web/20230403021729/https://fastcompanyme.com/news/abu-dhabi-based-tii-launches-its-own-version-of-chatgpt/ |url-status=live }}1 trillion tokens, from RefinedWeb (filtered web text corpus){{Cite arXiv |last1=Penedo |first1=Guilherme |last2=Malartic |first2=Quentin |last3=Hesslow |first3=Daniel |last4=Cojocaru |first4=Ruxandra |last5=Cappelli |first5=Alessandro |last6=Alobeidli |first6=Hamza |last7=Pannier |first7=Baptiste |last8=Almazrouei |first8=Ebtesam |last9=Launay |first9=Julien |date=2023-06-01 |title=The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only |class=cs.CL |eprint=2306.01116}} plus some "curated corpora".{{Cite web |date=2023-06-09 |title=tiiuae/falcon-40b · Hugging Face |url=https://huggingface.co/tiiuae/falcon-40b |access-date=2023-06-20 |website=huggingface.co}}

|2800

{{yes|Apache 2.0}}[https://www.businesswire.com/news/home/20230531005608/en/UAE's-Falcon-40B-World's-Top-Ranked-AI-Model-from-Technology-Innovation-Institute-is-Now-Royalty-Free UAE's Falcon 40B, World's Top-Ranked AI Model from Technology Innovation Institute, is Now Royalty-Free] {{Webarchive|url=https://web.archive.org/web/20240208133040/https://www.businesswire.com/news/home/20230531005608/en/UAE%27s-Falcon-40B-World%27s-Top-Ranked-AI-Model-from-Technology-Innovation-Institute-is-Now-Royalty-Free |date=2024-02-08 }}, 31 May 2023

|

BloombergGPT{{dts|2023-03}}Bloomberg L.P.{{sort|50|50}}363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets{{Cite arXiv|title=BloombergGPT: A Large Language Model for Finance|first1=Shijie|last1=Wu|first2=Ozan|last2=Irsoy|first3=Steven|last3=Lu|first4=Vadim|last4=Dabravolski|first5=Mark|last5=Dredze|first6=Sebastian|last6=Gehrmann|first7=Prabhanjan|last7=Kambadur|first8=David|last8=Rosenberg|first9=Gideon|last9=Mann|date=March 30, 2023|class=cs.LG |eprint=2303.17564}}

|

{{proprietary}}

| Trained on financial data from proprietary sources, for financial tasks.

PanGu-Σ{{dts|2023-03}}Huawei{{sort|1085|1085}}329 billion tokens{{Cite arXiv|title=PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing|first1=Xiaozhe|last1=Ren|first2=Pingyi|last2=Zhou|first3=Xinfan|last3=Meng|first4=Xinjing|last4=Huang|first5=Yadao|last5=Wang|first6=Weichao|last6=Wang|first7=Pengfei|last7=Li|first8=Xiaoda|last8=Zhang|first9=Alexander|last9=Podolskiy|first10=Grigory|last10=Arshinov|first11=Andrey|last11=Bout|first12=Irina|last12=Piontkovskaya|first13=Jiansheng|last13=Wei|first14=Xin|last14=Jiang|first15=Teng|last15=Su|first16=Qun|last16=Liu|first17=Jun|last17=Yao|date=March 19, 2023|class=cs.CL |eprint=2303.10845}}

|

{{proprietary}}

|

OpenAssistant{{Cite arXiv |last1=Köpf |first1=Andreas |last2=Kilcher |first2=Yannic |last3=von Rütte |first3=Dimitri |last4=Anagnostidis |first4=Sotiris |last5=Tam |first5=Zhi-Rui |last6=Stevens |first6=Keith |last7=Barhoum |first7=Abdullah |last8=Duc |first8=Nguyen Minh |last9=Stanley |first9=Oliver |last10=Nagyfi |first10=Richárd |last11=ES |first11=Shahul |last12=Suri |first12=Sameer |last13=Glushkov |first13=David |last14=Dantuluri |first14=Arnav |last15=Maguire |first15=Andrew |date=2023-04-14 |title=OpenAssistant Conversations – Democratizing Large Language Model Alignment |class=cs.CL |eprint=2304.07327}}{{dts|2023-03}}LAION{{sort|17|17}}1.5 trillion tokens

|

{{yes|Apache 2.0}}

| Trained on crowdsourced open data

Jurassic-2{{Cite web |last=Wrobel |first=Sharon |title=Tel Aviv startup rolls out new advanced AI language model to rival OpenAI |url=https://www.timesofisrael.com/ai21-labs-rolls-out-new-advanced-ai-language-model-to-rival-openai/ |access-date=2023-07-24 |website=www.timesofisrael.com |archive-date=2023-07-24 |archive-url=https://web.archive.org/web/20230724191823/https://www.timesofisrael.com/ai21-labs-rolls-out-new-advanced-ai-language-model-to-rival-openai/ |url-status=live }}

|{{dts|2023-03}}

|AI21 Labs

| {{Unknown}}

| {{Unknown}}

|

{{proprietary}}

|Multilingual{{Cite web |last=Wiggers |first=Kyle |date=2023-04-13 |title=With Bedrock, Amazon enters the generative AI race |url=https://techcrunch.com/2023/04/13/with-bedrock-amazon-enters-the-generative-ai-race/ |access-date=2023-07-24 |website=TechCrunch |archive-date=2023-07-24 |archive-url=https://web.archive.org/web/20230724102458/https://techcrunch.com/2023/04/13/with-bedrock-amazon-enters-the-generative-ai-race/ |url-status=live }}

PaLM 2 (Pathways Language Model 2){{dts|2023-05}}Google{{sort|340|340}}{{cite web |last=Elias |first=Jennifer |url=https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html |title=Google's newest A.I. model uses nearly five times more text data for training than its predecessor |work=CNBC |date=16 May 2023 |access-date=18 May 2023 |archive-date=16 May 2023 |archive-url=https://web.archive.org/web/20230516225326/https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html |url-status=live }}{{sort|3600000000000|3.6 trillion}} tokens

|{{sort|85000|85,000}}

{{proprietary}}

| Was used in Bard chatbot.{{Cite web|url=https://blog.google/technology/ai/google-palm-2-ai-large-language-model/|title=Introducing PaLM 2|date=May 10, 2023|website=Google|access-date=May 18, 2023|archive-date=May 18, 2023|archive-url=https://web.archive.org/web/20230518213209/https://blog.google/technology/ai/google-palm-2-ai-large-language-model/|url-status=live}}

Llama 2{{dts|2023-07}}Meta AI{{sort|70|70}}{{Cite web | url = https://ai.meta.com/llama/ | title = Introducing Llama 2: The Next Generation of Our Open Source Large Language Model | access-date = 2023-07-19 | website = Meta AI | date = 2023 | archive-date = 2024-01-05 | archive-url = https://web.archive.org/web/20240105234629/https://ai.meta.com/llama/ | url-status = live }}{{sort|2000000000000|2 trillion}} tokens

| {{sort|21000|21,000}}

{{partial success|Llama 2 license}}

| 1.7 million A100-hours.{{Cite web |title=llama/MODEL_CARD.md at main · meta-llama/llama |url=https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md |access-date=2024-05-28 |website=GitHub |archive-date=2024-05-28 |archive-url=https://web.archive.org/web/20240528090541/https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md |url-status=live }}

Claude 2

|{{dts|2023-07}}

|Anthropic

| {{Unknown}}

| {{Unknown}}

| {{Unknown}}

{{proprietary}}

| Used in Claude chatbot.{{cite web |title=Claude 2 |url=https://www.anthropic.com/index/claude-2 |website=anthropic.com |access-date=12 December 2023 |archive-date=15 December 2023 |archive-url=https://web.archive.org/web/20231215212208/https://www.anthropic.com/index/claude-2 |url-status=live }}

Granite 13b

|{{dts|2023-07}}

|IBM

| {{Unknown}}

| {{Unknown}}

| {{Unknown}}

{{proprietary}}

| Used in IBM Watsonx.{{Cite web |last=Nirmal |first=Dinesh |date=2023-09-07 |title=Building AI for business: IBM's Granite foundation models |url=https://www.ibm.com/blog/building-ai-for-business-ibms-granite-foundation-models |access-date=2024-08-11 |website=IBM Blog |language=en-US |archive-date=2024-07-22 |archive-url=https://web.archive.org/web/20240722083855/https://www.ibm.com/blog/building-ai-for-business-ibms-granite-foundation-models/ |url-status=live }}

Mistral 7B{{dts|2023-09}}Mistral AI{{sort|7.3|7.3}}{{Cite web | url = https://mistral.ai/news/announcing-mistral-7b/ | title = Announcing Mistral 7B | access-date = 2023-10-06 | website = Mistral | date = 2023 | archive-date = 2024-01-06 | archive-url = https://web.archive.org/web/20240106051047/https://mistral.ai/news/announcing-mistral-7b/ | url-status = live }}{{Unknown}}

|

{{yes|Apache 2.0}}

|

Claude 2.1

|{{dts|2023-11}}

|Anthropic

| {{Unknown}}

| {{Unknown}}

| {{Unknown}}

{{proprietary}}

| Used in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages.{{cite web |title=Introducing Claude 2.1 |url=https://www.anthropic.com/index/claude-2-1 |website=anthropic.com |access-date=12 December 2023 |archive-date=15 December 2023 |archive-url=https://web.archive.org/web/20231215201726/https://www.anthropic.com/index/claude-2-1 |url-status=live }}

Grok 1{{Citation |title=xai-org/grok-1 |date=2024-03-19 |url=https://github.com/xai-org/grok-1 |access-date=2024-03-19 |publisher=xai-org |archive-date=2024-05-28 |archive-url=https://web.archive.org/web/20240528170731/https://github.com/xai-org/grok-1 |url-status=live }}

|{{dts|2023-11}}

|xAI

314

| {{Unknown}}

| {{Unknown}}

{{yes|Apache 2.0}}

| Used in Grok chatbot. Grok 1 has a context length of 8,192 tokens and has access to X (Twitter).{{cite web |title=Grok-1 model card |url=https://x.ai/model-card/ |website=x.ai |access-date=12 December 2023}}

Gemini 1.0

|{{dts|2023-12}}

|Google DeepMind

| {{Unknown}}

| {{Unknown}}

| {{Unknown}}

{{proprietary}}

| Multimodal model, comes in three sizes. Used in the chatbot of the same name.{{cite web |title=Gemini – Google DeepMind |url=https://deepmind.google/technologies/gemini/#capabilities |website=deepmind.google |access-date=12 December 2023 |archive-date=8 December 2023 |archive-url=https://web.archive.org/web/20231208015607/https://deepmind.google/technologies/gemini/#capabilities |url-status=live }}

Mixtral 8x7B

|{{dts|2023-12}}

|Mistral AI

46.7

| {{Unknown}}

| {{Unknown}}

{{yes|Apache 2.0}}

| Outperforms GPT-3.5 and Llama 2 70B on many benchmarks.{{cite web |last1=Franzen |first1=Carl |title=Mistral shocks AI community as latest open source model eclipses GPT-3.5 performance |url=https://venturebeat.com/ai/mistral-shocks-ai-community-as-latest-open-source-model-eclipses-gpt-3-5-performance/ |website=VentureBeat |access-date=12 December 2023 |date=11 December 2023 |archive-date=11 December 2023 |archive-url=https://web.archive.org/web/20231211213640/https://venturebeat.com/ai/mistral-shocks-ai-community-as-latest-open-source-model-eclipses-gpt-3-5-performance/ |url-status=live }} Mixture of experts model, with 12.9 billion parameters activated per token.{{cite web |date=11 December 2023 |title=Mixtral of experts |url=https://mistral.ai/news/mixtral-of-experts/ |access-date=12 December 2023 |website=mistral.ai |archive-date=13 February 2024 |archive-url=https://web.archive.org/web/20240213104049/https://mistral.ai/news/mixtral-of-experts/ |url-status=live }}

Mixtral 8x22B

|{{dts|2024-04}}

|Mistral AI

141

| {{Unknown}}

| {{Unknown}}

{{yes|Apache 2.0}}

| {{Cite web |last=AI |first=Mistral |date=2024-04-17 |title=Cheaper, Better, Faster, Stronger |url=https://mistral.ai/news/mixtral-8x22b/ |access-date=2024-05-05 |website=mistral.ai |archive-date=2024-05-05 |archive-url=https://web.archive.org/web/20240505023828/https://mistral.ai/news/mixtral-8x22b/ |url-status=live }}

DeepSeek-LLM

|{{DTS|2023-11-29}}

|DeepSeek

|67

|2T tokens{{Citation |last1=DeepSeek-AI |title=DeepSeek LLM: Scaling Open-Source Language Models with Longtermism |date=2024-01-05 |arxiv=2401.02954 |last2=Bi |first2=Xiao |last3=Chen |first3=Deli |last4=Chen |first4=Guanting |last5=Chen |first5=Shanhuang |last6=Dai |first6=Damai |last7=Deng |first7=Chengqi |last8=Ding |first8=Honghui |last9=Dong |first9=Kai}}{{Pg|location=table 2}}

|{{sort|12000|12,000}}

|{{partial success|DeepSeek License}}

|Trained on English and Chinese text. 1e24 FLOPs for 67B. 1e23 FLOPs for 7B{{Pg|location=figure 5}}

Phi-2

|{{dts|2023-12}}

|Microsoft

2.7

|1.4T tokens

|419

{{yes|MIT}}

| Trained on real and synthetic "textbook-quality" data, for 14 days on 96 A100 GPUs.{{cite web |last1=Hughes |first1=Alyssa |title=Phi-2: The surprising power of small language models |url=https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ |website=Microsoft Research |access-date=13 December 2023 |date=12 December 2023 |archive-date=12 December 2023 |archive-url=https://web.archive.org/web/20231212232647/https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/ |url-status=live }}

Gemini 1.5

|{{dts|2024-02}}

|Google DeepMind

Unknown

| {{Unknown}}

| {{Unknown}}

{{proprietary}}

| Multimodal model, based on a Mixture-of-Experts (MoE) architecture. Context window above 1 million tokens.{{cite web |title=Our next-generation model: Gemini 1.5 |url=https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#context-window |website=Google |access-date=16 February 2024 |date=15 February 2024 |quote=This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we’ve also successfully tested up to 10 million tokens. |archive-date=16 February 2024 |archive-url=https://web.archive.org/web/20240216003052/https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#context-window |url-status=live }}

Gemini Ultra

|{{dts|2024-02}}

|Google DeepMind

Unknown

| {{Unknown}}

| {{Unknown}}

|
Gemma{{dts|2024-02}}Google DeepMind76T tokensUnknown{{partial success|Gemma Terms of Use}}{{cite web|url=https://ai.google.dev/gemma/terms|title=Gemma|via=GitHub}}
Claude 3

|{{dts|2024-03}}

|Anthropic

| {{Unknown}}

| {{Unknown}}

| {{Unknown}}

|{{proprietary}}

|Includes three models, Haiku, Sonnet, and Opus.{{Cite web |title=Introducing the next generation of Claude |url=https://www.anthropic.com/news/claude-3-family |access-date=2024-03-04 |website=www.anthropic.com |archive-date=2024-03-04 |archive-url=https://web.archive.org/web/20240304143650/https://www.anthropic.com/news/claude-3-family |url-status=live }}

[https://rubiks.ai/nova/release/ Nova]

|{{dts|2024-10}}

|[https://rubiks.ai/ Rubik's AI]

| {{Unknown}}

| {{Unknown}}

| {{Unknown}}

|{{proprietary}}

|Previous three models, Nova-Instant, Nova-Air, and Nova-Pro. Company shifted to Sonus AI.

[https://sonus.ai/blog/sonus-1 Sonus]{{Cite web |title=Sonus AI |url=https://sonus.ai/ |access-date=2025-03-07 |website=sonus.ai |language=en-US}}

|{{dts|2025-01}}

|[https://rubiks.ai/ Rubik's AI]

| {{Unknown}}

| {{Unknown}}

| {{Unknown}}

|{{proprietary}}

|

DBRX

|{{dts|2024-03}}

|Databricks and Mosaic ML

|{{sort|136|136}}

|12T Tokens

|

|{{Yes|Databricks Open Model License}}

|Training cost 10 million USD.

Fugaku-LLM

|{{dts|2024-05}}

|Fujitsu, Tokyo Institute of Technology, etc.

|{{sort|13|13}}

|380B Tokens

|

|

|The largest model ever trained on CPU-only, on the Fugaku.{{Cite web |title=Fugaku-LLM/Fugaku-LLM-13B · Hugging Face |url=https://huggingface.co/Fugaku-LLM/Fugaku-LLM-13B |access-date=2024-05-17 |website=huggingface.co |archive-date=2024-05-17 |archive-url=https://web.archive.org/web/20240517135225/https://huggingface.co/Fugaku-LLM/Fugaku-LLM-13B |url-status=live }}

Phi-3

|{{dts|2024-04}}

|Microsoft

|14{{cite web|title=Phi-3|url=https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms|access-date=2024-04-28|website=azure.microsoft.com|date=23 April 2024|archive-date=2024-04-27|archive-url=https://web.archive.org/web/20240427043835/https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/|url-status=live}}

|4.8T Tokens

|

|{{Yes|MIT}}

|Microsoft markets them as "small language model".{{cite web|title=Phi-3 Model Documentation|url=https://huggingface.co/docs/transformers/main/en/model_doc/phi3|access-date=2024-04-28|website=huggingface.co|archive-date=2024-05-13|archive-url=https://web.archive.org/web/20240513141513/https://huggingface.co/docs/transformers/main/en/model_doc/phi3|url-status=live}}

Granite Code Models

|{{dts|2024-05}}

|IBM

Unknown

| {{Unknown}}

| {{Unknown}}

{{yes|Apache 2.0}}

|

Qwen2

|{{dts|2024-06}}

|Alibaba Cloud

|72{{cite web|title= Qwen2|website= GitHub|url= https://github.com/QwenLM/Qwen2?spm=a3c0i.28768018.7084722650.1.5cd35c10NEqBXm&file=Qwen1.5|access-date= 2024-06-17|archive-date= 2024-06-17|archive-url= https://web.archive.org/web/20240617072401/https://github.com/QwenLM/Qwen2?spm=a3c0i.28768018.7084722650.1.5cd35c10NEqBXm&file=Qwen1.5|url-status= live}}

|3T Tokens

| {{Unknown}}

|{{partial success|Qwen License}}

|Multiple sizes, the smallest being 0.5B.

DeepSeek-V2

|{{DTS|2024-06}}

|DeepSeek

|236

|8.1T tokens

|{{sort|28000|28,000}}

|{{partial success|DeepSeek License}}

|1.4M hours on H800.{{Citation |last1=DeepSeek-AI |title=DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model |date=2024-06-19 |arxiv=2405.04434 |last2=Liu |first2=Aixin |last3=Feng |first3=Bei |last4=Wang |first4=Bin |last5=Wang |first5=Bingxuan |last6=Liu |first6=Bo |last7=Zhao |first7=Chenggang |last8=Dengr |first8=Chengqi |last9=Ruan |first9=Chong}}

Nemotron-4

|{{dts|2024-06}}

|Nvidia

|{{sort|340|340}}

|9T Tokens

| {{sort|200000|200,000}}

|{{Yes|NVIDIA Open Model License}}

|Trained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024.{{Cite web |date=2024-06-14 |title=nvidia/Nemotron-4-340B-Base · Hugging Face |url=https://huggingface.co/nvidia/Nemotron-4-340B-Base |access-date=2024-06-15 |website=huggingface.co |archive-date=2024-06-15 |archive-url=https://web.archive.org/web/20240615010323/https://huggingface.co/nvidia/Nemotron-4-340B-Base |url-status=live }}{{Cite web |title=Nemotron-4 340B {{!}} Research |url=https://research.nvidia.com/publication/2024-06_nemotron-4-340b |access-date=2024-06-15 |website=research.nvidia.com |archive-date=2024-06-15 |archive-url=https://web.archive.org/web/20240615010323/https://research.nvidia.com/publication/2024-06_nemotron-4-340b |url-status=live }}

Llama 3.1

|{{dts|2024-07}}

|Meta AI

|405

|15.6T tokens

|{{sort|440000|440,000}}

| {{partial success|Llama 3 license}}

|405B version took 31 million hours on H100-80GB, at 3.8E25 FLOPs.[https://ai.meta.com/research/publications/the-llama-3-herd-of-models/ "The Llama 3 Herd of Models" (July 23, 2024) Llama Team, AI @ Meta]{{Cite web |title=llama-models/models/llama3_1/MODEL_CARD.md at main · meta-llama/llama-models |url=https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md |access-date=2024-07-23 |website=GitHub |language=en |archive-date=2024-07-23 |archive-url=https://web.archive.org/web/20240723151851/https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md |url-status=live }}

DeepSeek-V3

|{{dts|2024-12}}

|DeepSeek

|671

|14.8T tokens

|{{sort|56000|56,000}}

|{{Yes|MIT}}

|2.788M hours on H800 GPUs.{{Citation |title=deepseek-ai/DeepSeek-V3 |date=2024-12-26 |url=https://github.com/deepseek-ai/DeepSeek-V3?tab=readme-ov-file |access-date=2024-12-26 |publisher=DeepSeek}} Originally released under the DeepSeek License, then re-released under the MIT License as "DeepSeek-V3-0324" in March 2025.{{cite web |last1=Feng |first1=Coco |title=DeepSeek wows coders with more powerful open-source V3 model |url=https://www.scmp.com/tech/big-tech/article/3303798/deepseeks-upgraded-foundational-model-excels-coding-and-maths |website=South China Morning Post |access-date=6 April 2025 |language=en |date=25 March 2025}}

Amazon Nova

|{{dts|2024-12}}

|Amazon

| {{Unknown}}

| {{Unknown}}

| {{Unknown}}

|{{proprietary}}

|Includes three models, Nova Micro, Nova Lite, and Nova Pro{{Citation |title=Amazon Nova Micro, Lite, and Pro - AWS AI Service Cards3 |date=2024-12-27 |url=https://docs.aws.amazon.com/ai/responsible-ai/nova-micro-lite-pro/overview.html |access-date=2024-12-27 |publisher=Amazon}}

DeepSeek-R1

|{{dts|2025-01}}

|DeepSeek

|671

|{{n/a|Not applicable}}

| {{Unknown}}

|{{Yes|MIT}}

|No pretraining. Reinforcement-learned upon V3-Base.{{Citation |title=deepseek-ai/DeepSeek-R1 |date=2025-01-21 |url=https://github.com/deepseek-ai/DeepSeek-R1 |access-date=2025-01-21 |publisher=DeepSeek}}{{Citation |last1=DeepSeek-AI |title=DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |date=2025-01-22 |arxiv=2501.12948 |last2=Guo |first2=Daya |last3=Yang |first3=Dejian |last4=Zhang |first4=Haowei |last5=Song |first5=Junxiao |last6=Zhang |first6=Ruoyu |last7=Xu |first7=Runxin |last8=Zhu |first8=Qihao |last9=Ma |first9=Shirong}}

Qwen2.5

|{{dts|2025-01}}

|Alibaba

|72

|18T tokens

| {{Unknown}}

|{{partial success|Qwen License}}

|7 dense models, with parameter count from 0.5B to 72B. They also released 2 MoE variants.{{Citation |last1=Qwen |title=Qwen2.5 Technical Report |date=2025-01-03 |arxiv=2412.15115 |last2=Yang |first2=An |last3=Yang |first3=Baosong |last4=Zhang |first4=Beichen |last5=Hui |first5=Binyuan |last6=Zheng |first6=Bo |last7=Yu |first7=Bowen |last8=Li |first8=Chengyuan |last9=Liu |first9=Dayiheng}}

MiniMax-Text-01

|{{dts|2025-01}}

|Minimax

|456

|4.7T tokens

| {{Unknown}}

|{{partial success|Minimax Model license}}

|{{Citation |title=MiniMax-AI/MiniMax-01 |date=2025-01-26 |url=https://github.com/MiniMax-AI/MiniMax-01?tab=readme-ov-file |access-date=2025-01-26 |publisher=MiniMax}}{{Citation |last1=MiniMax |title=MiniMax-01: Scaling Foundation Models with Lightning Attention |date=2025-01-14 |arxiv=2501.08313 |last2=Li |first2=Aonian |last3=Gong |first3=Bangwei |last4=Yang |first4=Bo |last5=Shan |first5=Boji |last6=Liu |first6=Chang |last7=Zhu |first7=Cheng |last8=Zhang |first8=Chunhao |last9=Guo |first9=Congchao}}

Gemini 2.0

|{{dts|2025-02}}

|Google DeepMind

Unknown

| {{Unknown}}

| {{Unknown}}

{{proprietary}}

| Three models released: Flash, Flash-Lite and Pro{{cite web |last1=Kavukcuoglu |first1=Koray |title=Gemini 2.0 is now available to everyone |url=https://blog.google/technology/google-deepmind/gemini-model-updates-february-2025/ |website=Google |date=5 February 2025 |access-date=6 February 2025}}{{cite web |title=Gemini 2.0: Flash, Flash-Lite and Pro |url=https://developers.googleblog.com/en/gemini-2-family-expands/ |website=Google for Developers |access-date=6 February 2025}}{{cite news |last1=Franzen |first1=Carl |title=Google launches Gemini 2.0 Pro, Flash-Lite and connects reasoning model Flash Thinking to YouTube, Maps and Search |url=https://venturebeat.com/ai/google-launches-gemini-2-0-pro-flash-lite-and-connects-reasoning-model-flash-thinking-to-youtube-maps-and-search/ |access-date=6 February 2025 |work=VentureBeat |date=5 February 2025}}

Mistral Large

|{{dts|2024-11}}

|Mistral AI

123

| {{Unknown}}

| {{Unknown}}

|{{partial success|Mistral Research License}}

|Upgraded over time. The latest version is 24.11.{{cite web |url=https://docs.mistral.ai/getting-started/models/models_overview/ |title=Models Overview |website=mistral.ai |access-date=2025-03-03}}

Pixtral

|{{dts|2024-11}}

|Mistral AI

123

| {{Unknown}}

| {{Unknown}}

|{{partial success|Mistral Research License}}

|Multimodal. There is also a 12B version which is under Apache 2 license.

Grok 3

|{{dts|2025-02}}

|xAI

| {{Unknown}}

| {{Unknown}}

| {{Unknown}},
estimated 5,800,000.

|{{proprietary}}

|Training cost claimed "10x the compute of previous state-of-the-art models".{{Cite web |title=Grok 3 Beta — The Age of Reasoning Agents |url=https://x.ai/blog/grok-3 |access-date=2025-02-22 |website=x.ai |language=en}}

Llama 4

|{{dts|2025-04-05}}

|Meta AI

|{{sort|400|400}}

|{{sort|40000000000000|40T tokens}}

|

{{partial success|Llama 4 license}}

|{{Cite web |date=2025-04-05 |title=meta-llama/Llama-4-Maverick-17B-128E · Hugging Face |url=https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E |access-date=2025-04-06 |website=huggingface.co}}{{Cite web |title=The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation |url=https://ai.meta.com/blog/llama-4-multimodal-intelligence/ |archive-url=http://web.archive.org/web/20250405185132/https://ai.meta.com/blog/llama-4-multimodal-intelligence/ |archive-date=2025-04-05 |access-date=2025-04-05 |website=ai.meta.com |language=en}}

Qwen3

|{{dts|2025-04}}

|Alibaba Cloud

|235

|{{sort|36000000000000|36T tokens}}

| {{Unknown}}

|{{yes|Apache 2.0}}

|Multiple sizes, the smallest being 0.6B.{{Cite web |last=Team |first=Qwen |date=2025-04-29 |title=Qwen3: Think Deeper, Act Faster |url=https://qwenlm.github.io/blog/qwen3/ |access-date=2025-04-29 |website=Qwen |language=en}}

See also

Notes

References

{{reflist}}

{{Natural Language Processing}}

{{Portal bar|Language}}

{{Authority control}}

Category:Software comparisons

Category:Large language models