Humanity's Last Exam
{{Short description|Language model benchmark}}
Humanity's Last Exam (HLE) is a language model benchmark consisting of 2,500 questions across a broad range of subjects. It was created jointly by the Center for AI Safety and Scale AI.
Creation
Stanford HAI's AI Index 2025 Annual Report cites Humanity's Last Exam as one of the "more challenging benchmarks" developed in response to the popular AI benchmarks having reached "saturation".{{cite report | last1 = Maslej | first1 = Nestor | last2 = Fattorini | first2 = Loredana | last3 = Perrault | first3 = Raymond | last4 = Gil | first4 = Yolanda | last5 = Parli | first5 = Vanessa | last6 = Kariuki | first6 = Njenga | last7 = Capstick | first7 = Emily | last8 = Reuel | first8 = Anka | last9 = Brynjolfsson | first9 = Erik | last10 = Etchemendy | first10 = John | last11 = Ligett | first11 = Katrina | last12 = Lyons | first12 = Terah | last13 = Manyika | first13 = James | last14 = Niebles | first14 = Juan Carlos | last15 = Shoham | first15 = Yoav | last16 = Wald | first16 = Russell | last17 = Walsh | first17 = Tobi | last18 = Hamrah | first18 = Armin | last19 = Santarlasci | first19 = Lapo | last20 = Lotufo | first20 = Julia Betts | last21 = Rome | first21 = Alexandra | last22 = Shi | first22 = Andrew | last23 = Oak | first23 = Sukrut | date = April 2025 | title = The AI Index 2025 Annual Report | url = https://hai-production.s3.amazonaws.com/files/hai_ai_index_report_2025.pdf | pages = 141–142| publisher = Institute for Human-Centered AI |display-authors=1}} The test has been described as the brainchild of Dan Hendrycks, a machine learning researcher and the director of the Center for AI Safety, who stated that he was inspired to create the test after a conversation with Elon Musk, who thought the existing language model benchmarks, such as the MMLU, were too easy. Hendrycks worked with Scale AI to compile the questions.{{cite web |last=Roose |first=Kevin |date=23 January 2025|title=When A.I. Passes This Test, Look Out |url=https://www.nytimes.com/2025/01/23/technology/ai-test-humanitys-last-exam.html |access-date=24 January 2025|website=New York Times |language=en-US |archive-url=https://archive.today/20250129113131/https://www.nytimes.com/2025/01/23/technology/ai-test-humanitys-last-exam.html |archive-date=29 January 2025}} The questions were crowdsourced from subject matter experts from various institutions across the world.{{cite web |last1=Dastin |first1=Jeffrey |last2=Paul |first2=Katie |date=16 September 2024|title=AI experts ready 'Humanity's Last Exam' to stump powerful tech |url=https://www.reuters.com/technology/artificial-intelligence/ai-experts-ready-humanitys-last-exam-stump-powerful-tech-2024-09-16/ |website=Reuters |access-date=24 January 2025 |archive-url=https://archive.today/20250408215942/https://www.reuters.com/technology/artificial-intelligence/ai-experts-ready-humanitys-last-exam-stump-powerful-tech-2024-09-16/ |archive-date=8 April 2025}} The questions were first filtered by the leading AI models; if the models failed to answer the question or did worse than random guessing on the multiple-choice questions, they were reviewed by human experts in two rounds and approved for inclusion in the dataset. The submitters of the top-rated questions were given prize money from a pool of 500,000 U.S. dollars—$5000 for each of the top 50 questions and $500 for the next 500. After the initial release, a "community feedback bug bounty program" was opened to "identify and remove major errors in the dataset".
Composition
The benchmark consists of 2,500 questions in the publicly released set. The paper classifies the questions into the following broad subjects: mathematics (41%), physics (9%), biology/medicine (11%), humanities/social science (9%), computer science/artificial intelligence (10%), engineering (4%), chemistry (7%), and other (9%). Around 14% of the questions require the ability to understand both text and images, i.e., multi-modality. 24% of the questions are multiple-choice; the rest are short-answer, exact-match questions. A private set is also maintained to test for benchmark overfitting.{{cite arXiv|eprint=2501.14249 |last1=Phan |first1=Long |last2=Gatti |first2=Alice |last3=Han |first3=Ziwen |last4=Li |first4=Nathaniel |last5=Hu |first5=Josephina |last6=Zhang |first6=Hugh |author7=Chen Bo Calvin Zhang |last8=Shaaban |first8=Mohamed |last9=Ling |first9=John |last10=Shi |first10=Sean |last11=Choi |first11=Michael |last12=Agrawal |first12=Anish |last13=Chopra |first13=Arnav |last14=Khoja |first14=Adam |last15=Kim |first15=Ryan |last16=Ren |first16=Richard |last17=Hausenloy |first17=Jason |last18=Zhang |first18=Oliver |last19=Mazeika |first19=Mantas |last20=Nguyen |first20=Tung |last21=Anderson |first21=Daron |author22=Imad Ali Shah |last23=Doroshenko |first23=Mikhail |author24=Alun Cennyth Stokes |last25=Mahmood |first25=Mobeen |last26=Lee |first26=Jaeho |last27=Pokutnyi |first27=Oleksandr |last28=Iskra |first28=Oleg |last29=Wang |first29=Jessica P. |last30=Gerbicz |first30=Robert |title=Humanity's Last Exam |date=2025 |class=cs.LG |display-authors=1 }}
{{Blockquote|Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.}}
Results
class="wikitable sortable plainrowheaders"
|+ Performance of various models on the benchmark | |||
scope="col" | Organization
! scope="col" class="unsortable" | Model ! scope="col" | Accuracy (%) ↑ ! scope="col" |Calibration Error (%) ↓ | |||
---|---|---|---|
OpenAI | o3 (high) | 20.32 | 34 |
Google DeepMind | Gemini 2.5 Pro Experimental | 18.16 | 71 |
Anthropic | Claude 3.7 Sonnet (16K) | 8.04 | 80 |
Meta AI | Llama 4 Maverick | 5.68 | 83 |
Mistral AI | Mistral Medium 3 | 4.52 | 77 |
Amazon Web Services | Nova Pro | 4.40 | 80 |
+ style="text-align:left; font-style:italic;" | Source: [https://scale.com/leaderboard/humanitys_last_exam Scale AI]. 15 May 2025. |
class="wikitable sortable plainrowheaders"
|+ Performance of various non-multimodal models on the text-only subset of the benchmark | |||
scope="col" | Organization
! scope="col" class="unsortable" | Model ! scope="col" | Accuracy (%) ↑ ! scope="col" |Calibration Error (%) ↓ | |||
---|---|---|---|
OpenAI | o3-mini (high) | 13.37 | 80 |
Alibaba Cloud | Qwen3-235B-A22B | 11.75 | 74 |
DeepSeek | DeepSeek-R1 | 8.54 | 73 |
Amazon Web Services | Nova Micro | 4.41 | 84 |
+ style="text-align:left; font-style:italic;" | Source: [https://scale.com/leaderboard/humanitys_last_exam_text_only Scale AI]. 1 May 2025. |
References
External links
- [https://agi.safe.ai Humanity's Last Exam] at the Center for AI Safety.
- [https://scale.com/leaderboard/humanitys_last_exam Humanity's Last Exam] at Scale AI.