AI alignment

In the field of artificial intelligence (AI), alignment aims to steer AI systems toward a person's or group's intended goals, preferences, or ethical principles. An AI system is considered aligned if it advances the intended objectives. A misaligned AI system pursues unintended objectives.

{{Cite book | last1=Russell | first1 = Stuart J.| last2=Norvig | first2 = Peter| year = 2021| url=https://www.pearson.com/us/higher-education/program/Russell-Artificial-Intelligence-A-Modern-Approach-4th-Edition/PGM1263338.html | title=Artificial intelligence: A modern approach| publisher=Pearson |isbn=9780134610993 |edition=4th | pages=5, 1003 | access-date=September 12, 2022 }}

It is often challenging for AI designers to align an AI system because it is difficult for them to specify the full range of desired and undesired behaviors. Therefore, AI designers often use simpler proxy goals, such as gaining human approval. But proxy goals can overlook necessary constraints or reward the AI system for merely appearing aligned.{{Cite journal |last1=Ngo |first1=Richard |last2=Chan |first2=Lawrence |last3=Mindermann |first3=Sören |date=2022 |title=The Alignment Problem from a Deep Learning Perspective |journal=International Conference on Learning Representations|arxiv=2209.00626 }} AI systems may also find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful, ways (reward hacking).

Advanced AI systems may develop unwanted instrumental strategies, such as seeking power or survival because such strategies help them achieve their assigned final goals.{{cite arXiv |eprint=2206.13353 |class=cs.CY |first=Joseph |last=Carlsmith |title=Is Power-Seeking AI an Existential Risk? |date=2022-06-16}} Furthermore, they might develop undesirable emergent goals that could be hard to detect before the system is deployed and encounters new situations and data distributions.{{Cite book |last=Christian |first=Brian |url=https://wwnorton.co.uk/books/9780393635829-the-alignment-problem |title=The alignment problem: Machine learning and human values |publisher=W. W. Norton & Company |year=2020 |isbn=978-0-393-86833-3 |oclc=1233266753 |access-date=September 12, 2022 |archive-url=https://web.archive.org/web/20230210114137/https://wwnorton.co.uk/books/9780393635829-the-alignment-problem |archive-date=February 10, 2023 |url-status=live}}{{Cite conference |last1=Langosco |first1=Lauro Langosco Di |last2=Koch |first2=Jack |last3=Sharkey |first3=Lee D. |last4=Pfau |first4=Jacob |last5=Krueger |first5=David |date=2022-06-28 |title=Goal Misgeneralization in Deep Reinforcement Learning |url=https://proceedings.mlr.press/v162/langosco22a.html |conference=International Conference on Machine Learning |publisher=PMLR |pages=12004–12019 |book-title=Proceedings of the 39th International Conference on Machine Learning |accessdate=2023-03-11}} Empirical research showed in 2024 that advanced large language models (LLMs) such as OpenAI o1 or Claude 3 sometimes engage in strategic deception to achieve their goals or prevent them from being changed.{{Cite magazine |last=Pillay |first=Tharin |date=2024-12-15 |title=New Tests Reveal AI's Capacity for Deception |url=https://time.com/7202312/new-tests-reveal-ai-capacity-for-deception/ |access-date=2025-01-12 |magazine=TIME |language=en}}{{Cite magazine |last=Perrigo |first=Billy |date=2024-12-18 |title=Exclusive: New Research Shows AI Strategically Lying |url=https://time.com/7202784/ai-research-strategic-lying/ |access-date=2025-01-12 |magazine=TIME |language=en}}

Today, some of these issues affect existing commercial systems such as LLMs,{{cite arXiv |eprint=2203.02155 |class=cs.CL |first1=Long |last1=Ouyang |first2=Jeff |last2=Wu |title=Training language models to follow instructions with human feedback |year=2022 |last3=Jiang |first3=Xu |last4=Almeida |first4=Diogo |last5=Wainwright |first5=Carroll L. |last6=Mishkin |first6=Pamela |last7=Zhang |first7=Chong |last8=Agarwal |first8=Sandhini |last9=Slama |first9=Katarina |last10=Ray |first10=Alex |last11=Schulman |first11=J. |last12=Hilton |first12=Jacob |last13=Kelton |first13=Fraser |last14=Miller |first14=Luke E. |last15=Simens |first15=Maddie |last16=Askell |first16=Amanda |last17=Welinder |first17=P. |last18=Christiano |first18=P. |last19=Leike |first19=J. |last20=Lowe |first20=Ryan J.}}{{Cite web |last1=Zaremba |first1=Wojciech |last2=Brockman |first2=Greg |last3=OpenAI |date=2021-08-10 |title=OpenAI Codex |url=https://openai.com/blog/openai-codex/ |url-status=live |archive-url=https://web.archive.org/web/20230203201912/https://openai.com/blog/openai-codex/ |archive-date=February 3, 2023 |accessdate=2022-07-23 |work=OpenAI}} robots,{{Cite journal |last1=Kober |first1=Jens |last2=Bagnell |first2=J. Andrew |last3=Peters |first3=Jan |date=2013-09-01 |title=Reinforcement learning in robotics: A survey |url=http://journals.sagepub.com/doi/10.1177/0278364913495721 |url-status=live |journal=The International Journal of Robotics Research |language=en |volume=32 |issue=11 |pages=1238–1274 |doi=10.1177/0278364913495721 |issn=0278-3649 |s2cid=1932843 |archive-url=https://web.archive.org/web/20221015200445/https://journals.sagepub.com/doi/10.1177/0278364913495721 |archive-date=October 15, 2022 |access-date=September 12, 2022}} autonomous vehicles,{{Cite journal |last1=Knox |first1=W. Bradley |last2=Allievi |first2=Alessandro |last3=Banzhaf |first3=Holger |last4=Schmitt |first4=Felix |last5=Stone |first5=Peter |date=2023-03-01 |title=Reward (Mis)design for autonomous driving |journal=Artificial Intelligence |language=en |volume=316 |pages=103829 |doi=10.1016/j.artint.2022.103829 |s2cid=233423198 |issn=0004-3702|doi-access=free |arxiv=2104.13906 }} and social media recommendation engines.{{Cite journal |last1=Bommasani |first1=Rishi |last2=Hudson |first2=Drew A. |last3=Adeli |first3=Ehsan |last4=Altman |first4=Russ |last5=Arora |first5=Simran |last6=von Arx |first6=Sydney |last7=Bernstein |first7=Michael S. |last8=Bohg |first8=Jeannette |last9=Bosselut |first9=Antoine |last10=Brunskill |first10=Emma |last11=Brynjolfsson |first11=Erik |date=2022-07-12 |title=On the Opportunities and Risks of Foundation Models |url=https://fsi.stanford.edu/publication/opportunities-and-risks-foundation-models |journal=Stanford CRFM |arxiv=2108.07258}}{{Cite book |last=Russell |first=Stuart J. |url=https://www.penguinrandomhouse.com/books/566677/human-compatible-by-stuart-russell/ |title=Human compatible: Artificial intelligence and the problem of control |publisher=Penguin Random House |year=2020 |isbn=9780525558637 |oclc=1113410915}}{{Cite journal |last=Stray |first=Jonathan |year=2020 |title=Aligning AI Optimization to Community Well-Being |journal=International Journal of Community Well-Being |language=en |volume=3 |issue=4 |pages=443–463 |doi=10.1007/s42413-020-00086-3 |issn=2524-5295 |pmc=7610010 |pmid=34723107 |s2cid=226254676}} Some AI researchers argue that more capable future systems will be more severely affected because these problems partially result from high capabilities.{{Cite book

|last1=Russell |first1=Stuart

|last2=Norvig |first2=Peter

|url=https://aima.cs.berkeley.edu/

|title=Artificial Intelligence: A Modern Approach |date=2009 |publisher=Prentice Hall |isbn=978-0-13-461099-3 |pages=1003}}{{Cite conference |last1=Pan |first1=Alexander |last2=Bhatia |first2=Kush |last3=Steinhardt |first3=Jacob |date=2022-02-14 |title=The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models |url=https://openreview.net/forum?id=JYtwGwIL7ye |conference=International Conference on Learning Representations |accessdate=2022-07-21}}

Many prominent AI researchers and the leadership of major AI companies have argued or asserted that AI is approaching human-like (AGI) and superhuman cognitive capabilities (ASI), and could endanger human civilization if misaligned.{{Cite web |last=Smith |first=Craig S. |title=Geoff Hinton, AI's Most Famous Researcher, Warns Of 'Existential Threat' |url=https://www.forbes.com/sites/craigsmith/2023/05/04/geoff-hinton-ais-most-famous-researcher-warns-of-existential-threat/ |access-date=2023-05-04 |website=Forbes |language=en}} These include "AI godfathers" Geoffrey Hinton and Yoshua Bengio and the CEOs of OpenAI, Anthropic, and Google DeepMind.{{cite journal |last1=Bengio |first1=Yoshua |last2=Hinton |first2=Geoffrey |last3=Yao |first3=Andrew |last4=Song |first4=Dawn |last5=Abbeel |first5=Pieter |last6=Harari |first6=Yuval Noah |last7=Zhang |first7=Ya-Qin |last8=Xue |first8=Lan |last9=Shalev-Shwartz |first9=Shai |date=2024 |title=Managing extreme AI risks amid rapid progress |journal=Science |volume=384 |issue=6698 |pages=842–845 |arxiv=2310.17688 |bibcode=2024Sci...384..842B |doi=10.1126/science.adn0117 |pmid=38768279}}{{Cite web |title=Statement on AI Risk {{!}} CAIS |url=https://www.safe.ai/statement-on-ai-risk |access-date=2024-02-11 |website=www.safe.ai |language=en}}{{cite arXiv |eprint=2401.02843 |class=cs.CY |first1=Katja |last1=Grace |first2=Harlan |last2=Stewart |title=Thousands of AI Authors on the Future of AI |date=2024-01-05 |last3=Sandkühler |first3=Julia Fabienne |last4=Thomas |first4=Stephen |last5=Weinstein-Raun |first5=Ben |last6=Brauner |first6=Jan}} These risks remain debated.{{Cite magazine |last=Perrigo |first=Billy |date=2024-02-13 |title=Meta's AI Chief Yann LeCun on AGI, Open-Source, and AI Risk |url=https://time.com/6694432/yann-lecun-meta-ai-interview/ |access-date=2024-06-26 |magazine=TIME |language=en}}

AI alignment is a subfield of AI safety, the study of how to build safe AI systems. Other subfields of AI safety include robustness, monitoring, and capability control. Research challenges in alignment include instilling complex values in AI, developing honest AI, scalable oversight, auditing and interpreting AI models, and preventing emergent AI behaviors like power-seeking.{{Cite web |last1=Ortega |first1=Pedro A. |last2=Maini |first2=Vishal |last3=DeepMind safety team |date=2018-09-27 |title=Building safe artificial intelligence: specification, robustness, and assurance |url=https://deepmindsafetyresearch.medium.com/building-safe-artificial-intelligence-52f5f75058f1 |url-status=live |archive-url=https://web.archive.org/web/20230210114142/https://deepmindsafetyresearch.medium.com/building-safe-artificial-intelligence-52f5f75058f1 |archive-date=February 10, 2023 |accessdate=2022-07-18 |work=DeepMind Safety Research – Medium}} Alignment research has connections to interpretability research,{{Cite web |last=Rorvig |first=Mordechai |date=2022-04-14 |title=Researchers Gain New Understanding From Simple AI |url=https://www.quantamagazine.org/researchers-glimpse-how-ai-gets-so-good-at-language-processing-20220414/ |url-status=live |archive-url=https://web.archive.org/web/20230210114056/https://www.quantamagazine.org/researchers-glimpse-how-ai-gets-so-good-at-language-processing-20220414/ |archive-date=February 10, 2023 |accessdate=2022-07-18 |work=Quanta Magazine}}{{Cite arXiv |eprint=1702.08608 |class=stat.ML |first1=Finale |last1=Doshi-Velez |first2=Been |last2=Kim |title=Towards A Rigorous Science of Interpretable Machine Learning |date=2017-03-02}}

{{Cite podcast |number=107 |last=Wiblin |first=Robert |title=Chris Olah on what the hell is going on inside neural networks |series=80,000 hours |accessdate=2022-07-23 |date=August 4, 2021 |url=https://80000hours.org/podcast/episodes/chris-olah-interpretability-research/}} (adversarial) robustness,{{Cite arXiv |eprint=1606.06565 |class=cs.AI |first1=Dario |last1=Amodei |first2=Chris |last2=Olah |title=Concrete Problems in AI Safety |date=2016-06-21 |language=en |last3=Steinhardt |first3=Jacob |last4=Christiano |first4=Paul |last5=Schulman |first5=John |last6=Mané |first6=Dan}} anomaly detection, calibrated uncertainty, formal verification,{{Cite journal |last1=Russell |first1=Stuart |last2=Dewey |first2=Daniel |last3=Tegmark |first3=Max |date=2015-12-31 |title=Research Priorities for Robust and Beneficial Artificial Intelligence |url=https://ojs.aaai.org/index.php/aimagazine/article/view/2577 |url-status=live |journal=AI Magazine |volume=36 |issue=4 |pages=105–114 |doi=10.1609/aimag.v36i4.2577 |issn=2371-9621 |s2cid=8174496 |archive-url=https://web.archive.org/web/20230202181059/https://ojs.aaai.org/index.php/aimagazine/article/view/2577 |archive-date=February 2, 2023 |access-date=September 12, 2022 |hdl=1721.1/108478|doi-access=free |arxiv=1602.03506 }} preference learning,{{Cite journal |last1=Wirth |first1=Christian |last2=Akrour |first2=Riad |last3=Neumann |first3=Gerhard |last4=Fürnkranz |first4=Johannes |year=2017 |title=A survey of preference-based reinforcement learning methods |journal=Journal of Machine Learning Research |volume=18 |issue=136 |pages=1–46}}{{Cite conference |last1=Christiano |first1=Paul F. |last2=Leike |first2=Jan |last3=Brown |first3=Tom B. |last4=Martic |first4=Miljan |last5=Legg |first5=Shane |last6=Amodei |first6=Dario |year=2017 |title=Deep reinforcement learning from human preferences |series=NIPS'17 |location=Red Hook, NY, USA |publisher=Curran Associates Inc. |pages=4302–4310 |isbn=978-1-5108-6096-4 |book-title=Proceedings of the 31st International Conference on Neural Information Processing Systems}}{{Cite web |last=Heaven |first=Will Douglas |date=2022-01-27 |title=The new version of GPT-3 is much better behaved (and should be less toxic) |url=https://www.technologyreview.com/2022/01/27/1044398/new-gpt3-openai-chatbot-language-model-ai-toxic-misinformation/ |url-status=live |archive-url=https://web.archive.org/web/20230210114056/https://www.technologyreview.com/2022/01/27/1044398/new-gpt3-openai-chatbot-language-model-ai-toxic-misinformation/ |archive-date=February 10, 2023 |accessdate=2022-07-18 |work=MIT Technology Review}} safety-critical engineering,{{cite arXiv |eprint=2106.04823 |class=cs.LG |first1=Sina |last1=Mohseni |first2=Haotao |last2=Wang |title=Taxonomy of Machine Learning Safety: A Survey and Primer |date=2022-03-07 |last3=Yu |first3=Zhiding |last4=Xiao |first4=Chaowei |last5=Wang |first5=Zhangyang |last6=Yadawa |first6=Jay}} game theory,{{Cite web |last=Clifton |first=Jesse |year=2020 |title=Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda |url=https://longtermrisk.org/research-agenda/ |url-status=live |archive-url=https://web.archive.org/web/20230101041759/https://longtermrisk.org/research-agenda |archive-date=January 1, 2023 |accessdate=2022-07-18 |work=Center on Long-Term Risk}}
{{Cite journal |last1=Dafoe |first1=Allan |last2=Bachrach |first2=Yoram |last3=Hadfield |first3=Gillian |last4=Horvitz |first4=Eric |last5=Larson |first5=Kate |last6=Graepel |first6=Thore |date=2021-05-06 |title=Cooperative AI: machines must learn to find common ground |url=http://www.nature.com/articles/d41586-021-01170-0 |url-status=live |journal=Nature |language=en |volume=593 |issue=7857 |pages=33–36 |bibcode=2021Natur.593...33D |doi=10.1038/d41586-021-01170-0 |issn=0028-0836 |pmid=33947992 |s2cid=233740521 |archive-url=https://web.archive.org/web/20221218210857/https://www.nature.com/articles/d41586-021-01170-0 |archive-date=December 18, 2022 |access-date=September 12, 2022}} algorithmic fairness,{{Cite book |last1=Prunkl |first1=Carina |last2=Whittlestone |first2=Jess |title=Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society |chapter=Beyond Near- and Long-Term |date=2020-02-07 |chapter-url=https://dl.acm.org/doi/10.1145/3375627.3375803 |url-status=live |language=en |location=New York NY USA |publisher=ACM |pages=138–143 |doi=10.1145/3375627.3375803 |isbn=978-1-4503-7110-0 |s2cid=210164673 |archive-url=https://web.archive.org/web/20221016123733/https://dl.acm.org/doi/10.1145/3375627.3375803 |archive-date=October 16, 2022 |access-date=September 12, 2022}} and social sciences.{{Cite journal |last1=Irving |first1=Geoffrey |last2=Askell |first2=Amanda |date=2019-02-19 |title=AI Safety Needs Social Scientists |url=https://distill.pub/2019/safety-needs-social-scientists |url-status=live |journal=Distill |volume=4 |issue=2 |pages=10.23915/distill.00014 |doi=10.23915/distill.00014 |issn=2476-0757 |s2cid=159180422 |archive-url=https://web.archive.org/web/20230210114220/https://distill.pub/2019/safety-needs-social-scientists/ |archive-date=February 10, 2023 |access-date=September 12, 2022|doi-access=free }}{{Cite journal |last1=Gazos |first1=Alexandros |last2=Kahn |first2=James |last3=Kusche |first3=Isabel |last4=Büscher |first4=Christian |last5=Götz |first5=Markus |date=2025-04-01 |title=Organising AI for safety: Identifying structural vulnerabilities to guide the design of AI-enhanced socio-technical systems |journal=Safety Science |volume=184 |pages=106731 |doi=10.1016/j.ssci.2024.106731 |issn=0925-7535|doi-access=free }}

Objectives in AI

Programmers provide an AI system such as AlphaZero with an "objective function",{{efn|Terminology varies based on context. Similar concepts include goal function, utility function, loss function, etc.}} in which they intend to encapsulate the goal(s) the AI is configured to accomplish. Such a system later populates a (possibly implicit) internal "model" of its environment. This model encapsulates all the agent's beliefs about the world. The AI then creates and executes whatever plan is calculated to maximize{{efn|or minimize, depending on the context}} the value{{efn|in the presence of uncertainty, the expected value}} of its objective function.Bringsjord, Selmer and Govindarajulu, Naveen Sundar, [https://plato.stanford.edu/archives/sum2020/entries/artificial-intelligence/ "Artificial Intelligence"], The Stanford Encyclopedia of Philosophy (Summer 2020 Edition), Edward N. Zalta (ed.) For example, when AlphaZero is trained on chess, it has a simple objective function of "+1 if AlphaZero wins, −1 if AlphaZero loses". During the game, AlphaZero attempts to execute whatever sequence of moves it judges most likely to attain the maximum value of +1.{{cite news |title=Why AlphaZero's Artificial Intelligence Has Trouble With the Real World |url=https://www.quantamagazine.org/why-alphazeros-artificial-intelligence-has-trouble-with-the-real-world-20180221/ |accessdate=20 June 2020 |work=Quanta Magazine |date=2018 |language=en}} Similarly, a reinforcement learning system can have a "reward function" that allows the programmers to shape the AI's desired behavior.{{cite news |last1=Wolchover |first1=Natalie |title=Artificial Intelligence Will Do What We Ask. That's a Problem. |url=https://www.quantamagazine.org/artificial-intelligence-will-do-what-we-ask-thats-a-problem-20200130/ |work=Quanta Magazine |date=30 January 2020 |language=en |accessdate=21 June 2020}} An evolutionary algorithm's behavior is shaped by a "fitness function".Bull, Larry. "On model-based evolutionary computation." Soft Computing 3, no. 2 (1999): 76–82.

Alignment problem

In 1960, AI pioneer Norbert Wiener described the AI alignment problem as follows:

If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively ... we had better be quite sure that the purpose put into the machine is the purpose which we really desire.{{Cite journal |last=Wiener |first=Norbert |date=1960-05-06 |title=Some Moral and Technical Consequences of Automation: As machines learn they may develop unforeseen strategies at rates that baffle their programmers. |url=https://www.science.org/doi/10.1126/science.131.3410.1355 |url-status=live |journal=Science |language=en |volume=131 |issue=3410 |pages=1355–1358 |doi=10.1126/science.131.3410.1355 |issn=0036-8075 |pmid=17841602 |s2cid=30855376 |archive-url=https://web.archive.org/web/20221015105034/https://www.science.org/doi/10.1126/science.131.3410.1355 |archive-date=October 15, 2022 |access-date=September 12, 2022|url-access=subscription }}

AI alignment involves ensuring that an AI system's objectives match those of its designers or users, or match widely shared values, objective ethical standards, or the intentions its designers would have if they were more informed and enlightened.{{Cite journal |last=Gabriel |first=Iason |date=2020-09-01 |title=Artificial Intelligence, Values, and Alignment |journal=Minds and Machines |volume=30 |issue=3 |pages=411–437 |doi=10.1007/s11023-020-09539-2 |issn=1572-8641 |s2cid=210920551 |doi-access=free |arxiv=2001.09768 }}

AI alignment is an open problem for modern AI systems{{Cite news |last=The Ezra Klein Show |date=2021-06-04 |title=If 'All Models Are Wrong,' Why Do We Give Them So Much Power? |work=The New York Times |url=https://www.nytimes.com/2021/06/04/opinion/ezra-klein-podcast-brian-christian.html |url-status=live |accessdate=2023-03-13 |archive-url=https://web.archive.org/web/20230215224050/https://www.nytimes.com/2021/06/04/opinion/ezra-klein-podcast-brian-christian.html |archive-date=February 15, 2023 |issn=0362-4331}}

{{Cite web |last=Wolchover |first=Natalie |date=2015-04-21 |title=Concerns of an Artificial Intelligence Pioneer |url=https://www.quantamagazine.org/artificial-intelligence-aligned-with-human-values-qa-with-stuart-russell-20150421/ |url-status=live |archive-url=https://web.archive.org/web/20230210114137/https://www.quantamagazine.org/artificial-intelligence-aligned-with-human-values-qa-with-stuart-russell-20150421/ |archive-date=February 10, 2023 |accessdate=2023-03-13 |work=Quanta Magazine}}
{{Cite web |last=California Assembly |title=Bill Text – ACR-215 23 Asilomar AI Principles. |url=https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=201720180ACR215 |url-status=live |archive-url=https://web.archive.org/web/20230210114137/https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=201720180ACR215 |archive-date=February 10, 2023 |accessdate=2022-07-18}}{{Cite news |last1=Johnson |first1=Steven |last2=Iziev |first2=Nikita |date=2022-04-15 |title=A.I. Is Mastering Language. Should We Trust What It Says? |work=The New York Times |url=https://www.nytimes.com/2022/04/15/magazine/ai-language.html |url-status=live |accessdate=2022-07-18 |archive-url=https://web.archive.org/web/20221124151408/https://www.nytimes.com/2022/04/15/magazine/ai-language.html |archive-date=November 24, 2022 |issn=0362-4331}} and is a research field within AI.{{Cite web |last=OpenAI |title=Developing safe & responsible AI |url=https://openai.com/blog/our-approach-to-alignment-research |accessdate=2023-03-13}}
{{Cite web |title=DeepMind Safety Research |url=https://deepmindsafetyresearch.medium.com |url-status=live |archive-url=https://web.archive.org/web/20230210114142/https://deepmindsafetyresearch.medium.com/ |archive-date=February 10, 2023 |accessdate=2023-03-13 |work=Medium}} Aligning AI involves two main challenges: carefully specifying the purpose of the system (outer alignment) and ensuring that the system adopts the specification robustly (inner alignment).{{r|dlp2023}} Researchers also attempt to create AI models that have robust alignment, sticking to safety constraints even when users adversarially try to bypass them.

= Specification gaming and side effects =

To specify an AI system's purpose, AI designers typically provide an objective function, examples, or feedback to the system. But designers are often unable to completely specify all important values and constraints, so they resort to easy-to-specify proxy goals such as maximizing the approval of human overseers, who are fallible.{{r|concrete2016|building2018}}{{Cite arXiv |last1=Hendrycks |first1=Dan |last2=Carlini |first2=Nicholas | author2-link = Nicholas Carlini |last3=Schulman |first3=John |last4=Steinhardt |first4=Jacob |date=2022-06-16 |title=Unsolved Problems in ML Safety |class=cs.LG |eprint=2109.13916 }}{{Cite book |last1=Russell |first1=Stuart J. |url=https://www.pearson.com/us/higher-education/program/Russell-Artificial-Intelligence-A-Modern-Approach-4th-Edition/PGM1263338.html |title=Artificial intelligence: a modern approach |last2=Norvig |first2=Peter |date=2022 |publisher=Pearson |isbn=978-1-292-40113-3 |edition=4th |pages=4–5 |oclc=1303900751}}{{Cite web |last1=Krakovna |first1=Victoria |last2=Uesato |first2=Jonathan |last3=Mikulik |first3=Vladimir |last4=Rahtz |first4=Matthew |last5=Everitt |first5=Tom |last6=Kumar |first6=Ramana |last7=Kenton |first7=Zac |last8=Leike |first8=Jan |last9=Legg |first9=Shane |date=2020-04-21 |title=Specification gaming: the flip side of AI ingenuity |url=https://www.deepmind.com/blog/specification-gaming-the-flip-side-of-ai-ingenuity |url-status=live |archive-url=https://web.archive.org/web/20230210114143/https://www.deepmind.com/blog/specification-gaming-the-flip-side-of-ai-ingenuity |archive-date=February 10, 2023 |accessdate=2022-08-26 |work=Deepmind}} As a result, AI systems can find loopholes that help them accomplish the specified objective efficiently but in unintended, possibly harmful ways. This tendency is known as specification gaming or reward hacking, and is an instance of Goodhart's law.{{r|SpecGaming2020|mmmm2022}}{{Cite arXiv |eprint=1803.04585 |class=cs.AI |first1=David |last1=Manheim |first2=Scott |last2=Garrabrant |title=Categorizing Variants of Goodhart's Law |year=2018}} As AI systems become more capable, they are often able to game their specifications more effectively.{{r|mmmm2022}}

File:Robot hand trained with human feedback 'pretends' to grasp ball.ogg

Specification gaming has been observed in numerous AI systems.{{r|SpecGaming2020}}{{Cite web|url=https://docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml|title=Specification gaming examples in AI - master list - Google Drive|website=docs.google.com}} One system was trained to finish a simulated boat race by rewarding the system for hitting targets along the track, but the system achieved more reward by looping and crashing into the same targets indefinitely.{{Cite web |last1=Clark |first1=Jack |last2=Amodei |first2=Dario |date=21 Dec 2016 |title=Faulty reward functions in the wild |url=https://openai.com/research/faulty-reward-functions |access-date=2023-12-30 |website=openai.com |language=en-US}} Similarly, a simulated robot was trained to grab a ball by rewarding the robot for getting positive feedback from humans, but it learned to place its hand between the ball and camera, making it falsely appear successful (see video).{{r|lfhp2017}} Chatbots often produce falsehoods if they are based on language models that are trained to imitate text from internet corpora, which are broad but fallible.{{Cite journal |last1=Lin |first1=Stephanie |last2=Hilton |first2=Jacob |last3=Evans |first3=Owain |year=2022 |title=TruthfulQA: Measuring How Models Mimic Human Falsehoods |url=https://aclanthology.org/2022.acl-long.229 |url-status=live |journal=Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |language=en |location=Dublin, Ireland |publisher=Association for Computational Linguistics |pages=3214–3252 |doi=10.18653/v1/2022.acl-long.229 |s2cid=237532606 |archive-url=https://web.archive.org/web/20230210114231/https://aclanthology.org/2022.acl-long.229/ |archive-date=February 10, 2023 |access-date=September 12, 2022|doi-access=free |arxiv=2109.07958 }}{{Cite news |last=Naughton |first=John |date=2021-10-02 |title=The truth about artificial intelligence? It isn't that honest |work=The Observer |url=https://www.theguardian.com/commentisfree/2021/oct/02/the-truth-about-artificial-intelligence-it-isnt-that-honest |url-status=live |accessdate=2022-07-23 |archive-url=https://web.archive.org/web/20230213231317/https://www.theguardian.com/commentisfree/2021/oct/02/the-truth-about-artificial-intelligence-it-isnt-that-honest |archive-date=February 13, 2023 |issn=0029-7712}} When they are retrained to produce text that humans rate as true or helpful, chatbots like ChatGPT can fabricate fake explanations that humans find convincing, often called "hallucinations".{{Cite journal |last1=Ji |first1=Ziwei |last2=Lee |first2=Nayeon |last3=Frieske |first3=Rita |last4=Yu |first4=Tiezheng |last5=Su |first5=Dan |last6=Xu |first6=Yan |last7=Ishii |first7=Etsuko |last8=Bang |first8=Yejin |last9=Madotto |first9=Andrea |last10=Fung |first10=Pascale |date=2022-02-01 |title=Survey of Hallucination in Natural Language Generation |journal=ACM Computing Surveys |volume=55 |issue=12 |pages=1–38 |arxiv=2202.03629 |doi=10.1145/3571730 |s2cid=246652372 }}

{{Cite journal |last=Else |first=Holly |date=2023-01-12 |title=Abstracts written by ChatGPT fool scientists |url=https://www.nature.com/articles/d41586-023-00056-7 |journal=Nature |language=en |volume=613 |issue=7944 |pages=423 |doi=10.1038/d41586-023-00056-7|pmid=36635510 |bibcode=2023Natur.613..423E |s2cid=255773668 |url-access=subscription }} Some alignment researchers aim to help humans detect specification gaming and to steer AI systems toward carefully specified objectives that are safe and useful to pursue.

When a misaligned AI system is deployed, it can have consequential side effects. Social media platforms have been known to optimize for click-through rates, causing user addiction on a global scale.{{r|Unsolved2022}} Stanford researchers say that such recommender systems are misaligned with their users because they "optimize simple engagement metrics rather than a harder-to-measure combination of societal and consumer well-being".{{r|Opportunities_Risks}}

Explaining such side effects, Berkeley computer scientist Stuart Russell noted that the omission of implicit constraints can cause harm: "A system ... will often set ... unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer's apprentice, or King Midas: you get exactly what you ask for, not what you want."{{Cite web |last=Russell|first=Stuart|website=Edge.org |title=Of Myths and Moonshine|url=https://www.edge.org/conversation/the-myth-of-ai |url-status=live |archive-url=https://web.archive.org/web/20230210114137/https://www.edge.org/conversation/the-myth-of-ai |archive-date=February 10, 2023 |accessdate=2022-07-19}}

Some researchers suggest that AI designers specify their desired goals by listing forbidden actions or by formalizing ethical rules (as with Asimov's Three Laws of Robotics).{{Cite journal |last=Tasioulas |first=John |year=2019 |title=First Steps Towards an Ethics of Robots and Artificial Intelligence |journal=Journal of Practical Ethics |volume=7 |issue=1 |pages=61–95}} But Russell and Norvig argue that this approach overlooks the complexity of human values: "It is certainly very hard, and perhaps impossible, for mere humans to anticipate and rule out in advance all the disastrous ways the machine could choose to achieve a specified objective."

Additionally, even if an AI system fully understands human intentions, it may still disregard them, because following human intentions may not be its objective (unless it is already fully aligned).

A 2025 study by Palisade Research found that when tasked to win at chess against a stronger opponent, some reasoning LLMs attempted to hack the game system. o1-preview spontaneously attempted it in 37% of cases, while DeepSeek R1 did so in 11% of cases. Other models, like GPT-4o, Claude 3.5 Sonnet, and o3-mini, attempted to cheat only when researchers provided hints about this possibility.{{Cite magazine |last=Booth |first=Harry |date=2025-02-19 |title=When AI Thinks It Will Lose, It Sometimes Cheats |url=https://time.com/7259395/ai-chess-cheating-palisade-research/ |access-date=2025-02-23 |magazine=TIME |language=en}}

= Pressure to deploy unsafe systems =

Commercial organizations sometimes have incentives to take shortcuts on safety and to deploy misaligned or unsafe AI systems.{{r|Unsolved2022}} For example, social media recommender systems have been profitable despite creating unwanted addiction and polarization.{{r|Opportunities_Risks}}{{Cite news |last1=Wells |first1=Georgia |last2=Deepa Seetharaman |last3=Horwitz |first3=Jeff |date=2021-11-05 |title=Is Facebook Bad for You? It Is for About 360 Million Users, Company Surveys Suggest |work=The Wall Street Journal |url=https://www.wsj.com/articles/facebook-bad-for-you-360-million-users-say-yes-company-documents-facebook-files-11636124681 |url-status=live |accessdate=2022-07-19 |archive-url=https://web.archive.org/web/20230210114137/https://www.wsj.com/articles/facebook-bad-for-you-360-million-users-say-yes-company-documents-facebook-files-11636124681 |archive-date=February 10, 2023 |issn=0099-9660}}{{Cite report |url=https://bhr.stern.nyu.edu/polarization-report-page |title=How Social Media Intensifies U.S. Political Polarization-And What Can Be Done About It |last1=Barrett |first1=Paul M. |last2=Hendrix |first2=Justin |date=September 2021 |publisher=Center for Business and Human Rights, NYU |last3=Sims |first3=J. Grant |access-date=September 12, 2022 |archive-url=https://web.archive.org/web/20230201180005/https://bhr.stern.nyu.edu/polarization-report-page |archive-date=February 1, 2023 |url-status=live}} Competitive pressure can also lead to a race to the bottom on AI safety standards. In 2018, a self-driving car killed a pedestrian (Elaine Herzberg) after engineers disabled the emergency braking system because it was oversensitive and slowed development.{{Cite news |last=Shepardson |first=David |date=2018-05-24 |title=Uber disabled emergency braking in self-driving car: U.S. agency |work=Reuters |url=https://www.reuters.com/article/us-uber-crash-idUSKCN1IP26K |url-status=live |accessdate=2022-07-20 |archive-url=https://web.archive.org/web/20230210114137/https://www.reuters.com/article/us-uber-crash-idUSKCN1IP26K |archive-date=February 10, 2023}}

= Risks from advanced misaligned AI =

Some researchers are interested in aligning increasingly advanced AI systems, as progress in AI development is rapid, and industry and governments are trying to build advanced AI. As AI system capabilities continue to rapidly expand in scope, they could unlock many opportunities if aligned, but consequently may further complicate the task of alignment due to their increased complexity, potentially posing large-scale hazards.

== Development of advanced AI ==

Many AI companies, such as OpenAI,{{Cite web |title=The messy, secretive reality behind OpenAI's bid to save the world |url=https://www.technologyreview.com/2020/02/17/844721/ai-openai-moonshot-elon-musk-sam-altman-greg-brockman-messy-secretive-reality/ |access-date=2024-08-25 |website=MIT Technology Review |language=en}} Meta{{Cite web |last=Heath |first=Alex |date=2024-01-18 |title=Mark Zuckerberg's new goal is creating artificial general intelligence |url=https://www.theverge.com/2024/1/18/24042354/mark-zuckerberg-meta-agi-reorg-interview |access-date=2024-11-05 |website=The Verge |language=en}} and DeepMind,{{Cite web |last=Johnson |first=Dave |title=DeepMind is Google's AI research hub. Here's what it does, where it's located, and how it differs from OpenAI. |url=https://www.businessinsider.com/google-deepmind |access-date=2024-08-25 |website=Business Insider |language=en-US}} have stated their aim to develop artificial general intelligence (AGI), a hypothesized AI system that matches or outperforms humans at a broad range of cognitive tasks. Researchers who scale modern neural networks observe that they indeed develop increasingly general and unanticipated capabilities.{{r|Opportunities_Risks}}{{Cite journal |last1=Wei |first1=Jason |last2=Tay |first2=Yi |last3=Bommasani |first3=Rishi |last4=Raffel |first4=Colin |last5=Zoph |first5=Barret |last6=Borgeaud |first6=Sebastian |last7=Yogatama |first7=Dani |last8=Bosma |first8=Maarten |last9=Zhou |first9=Denny |last10=Metzler |first10=Donald |last11=Chi |first11=Ed H. |last12=Hashimoto |first12=Tatsunori |last13=Vinyals |first13=Oriol |last14=Liang |first14=Percy |last15=Dean |first15=Jeff |last16=Fedus |first16=William| date=2022-10-26 |title=Emergent Abilities of Large Language Models |journal=Transactions on Machine Learning Research | issn=2835-8856 | arxiv=2206.07682 }}{{cite arXiv | eprint=2210.14891 | last1=Caballero | first1=Ethan | last2=Gupta | first2=Kshitij | last3=Rish | first3=Irina | last4=Krueger | first4=David | title=Broken Neural Scaling Laws | date=2022 | class=cs.LG }} Such models have learned to operate a computer or write their own programs; a single "generalist" network can chat, control robots, play games, and interpret photographs.{{Cite web |last=Dominguez |first=Daniel |date=2022-05-19 |title=DeepMind Introduces Gato, a New Generalist AI Agent |url=https://www.infoq.com/news/2022/05/deepmind-gato-ai-agent/ |url-status=live |archive-url=https://web.archive.org/web/20230210114137/https://www.infoq.com/news/2022/05/deepmind-gato-ai-agent/ |archive-date=February 10, 2023 |accessdate=2022-09-09 |work=InfoQ}}

{{Cite web |last=Edwards |first=Ben |date=2022-04-26 |title=Adept's AI assistant can browse, search, and use web apps like a human |url=https://arstechnica.com/information-technology/2022/09/new-ai-assistant-can-browse-search-and-use-web-apps-like-a-human/ |url-status=live |archive-url=https://web.archive.org/web/20230117194921/https://arstechnica.com/information-technology/2022/09/new-ai-assistant-can-browse-search-and-use-web-apps-like-a-human/ |archive-date=January 17, 2023 |accessdate=2022-09-09 |work=Ars Technica}} According to surveys, some leading machine learning researchers expect AGI to be created in {{as of|2021|alt=this decade}}, while some believe it will take much longer. Many consider both scenarios possible.{{cite arXiv |last1=Grace |first1=Katja |title=Thousands of AI Authors on the Future of AI |date=2024-01-05 |eprint=2401.02843 |last2=Stewart |first2=Harlan |last3=Sandkühler |first3=Julia Fabienne |last4=Thomas |first4=Stephen |last5=Weinstein-Raun |first5=Ben |last6=Brauner |first6=Jan|class=cs.CY }}{{Cite journal |last1=Grace |first1=Katja |last2=Salvatier |first2=John |last3=Dafoe |first3=Allan |last4=Zhang |first4=Baobao |last5=Evans |first5=Owain |date=2018-07-31 |title=Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts |url=http://jair.org/index.php/jair/article/view/11222 |url-status=live |journal=Journal of Artificial Intelligence Research |volume=62 |pages=729–754 |doi=10.1613/jair.1.11222 |issn=1076-9757 |s2cid=8746462 |archive-url=https://web.archive.org/web/20230210114220/https://jair.org/index.php/jair/article/view/11222 |archive-date=February 10, 2023 |access-date=September 12, 2022|doi-access=free }}{{Cite journal |last1=Zhang |first1=Baobao |last2=Anderljung |first2=Markus |last3=Kahn |first3=Lauren |last4=Dreksler |first4=Noemi |last5=Horowitz |first5=Michael C. |last6=Dafoe |first6=Allan |date=2021-08-02 |title=Ethics and Governance of Artificial Intelligence: Evidence from a Survey of Machine Learning Researchers |url=https://jair.org/index.php/jair/article/view/12895 |url-status=live |journal=Journal of Artificial Intelligence Research |volume=71 |doi=10.1613/jair.1.12895 |issn=1076-9757 |s2cid=233740003 |archive-url=https://web.archive.org/web/20230210114143/https://jair.org/index.php/jair/article/view/12895 |archive-date=February 10, 2023 |access-date=September 12, 2022|doi-access=free |arxiv=2105.02117 }}

In 2023, leaders in AI research and tech signed an open letter calling for a pause in the largest AI training runs. The letter stated, "Powerful AI systems should be developed only once we are confident that their effects will be positive and their risks will be manageable."{{Cite web |last=Future of Life Institute |date=2023-03-22 |title=Pause Giant AI Experiments: An Open Letter |url=https://futureoflife.org/open-letter/pause-giant-ai-experiments/ |accessdate=2023-04-20 }}

== Power-seeking ==

{{As of|2023|05|alt=Current|df=US}} systems still have limited long-term planning ability and situational awareness{{r|Opportunities_Risks}}, but large efforts are underway to change this.{{cite journal |last1=Wang |first1=Lei |title=A survey on large language model based autonomous agents |date=2024 |url=https://ui.adsabs.harvard.edu/abs/2023arXiv230811432W |access-date=2024-02-11 |arxiv=2308.11432 |last2=Ma |first2=Chen |last3=Feng |first3=Xueyang |last4=Zhang |first4=Zeyu |last5=Yang |first5=Hao |last6=Zhang |first6=Jingsen |last7=Chen |first7=Zhiyuan |last8=Tang |first8=Jiakai |last9=Chen |first9=Xu|journal=Frontiers of Computer Science |volume=18 |issue=6 |doi=10.1007/s11704-024-40231-1 }}{{cite arXiv |last1=Berglund |first1=Lukas |title=Taken out of context: On measuring situational awareness in LLMs |date=2023-09-01 |eprint=2309.00667 |last2=Stickland |first2=Asa Cooper |last3=Balesni |first3=Mikita |last4=Kaufmann |first4=Max |last5=Tong |first5=Meg |last6=Korbak |first6=Tomasz |last7=Kokotajlo |first7=Daniel |last8=Evans |first8=Owain|class=cs.CL }}{{Cite journal |last1=Laine |first1=Rudolf |last2=Meinke |first2=Alexander |last3=Evans |first3=Owain |date=2023-11-28 |title=Towards a Situational Awareness Benchmark for LLMs |url=https://openreview.net/forum?id=DRk4bWKr41&referrer=%5Bthe+profile+of+Rudolf+Laine%5D(/profile?id=~Rudolf_Laine1) |journal=NeurIPS 2023 SoLaR Workshop |language=en}} Future systems (not necessarily AGIs) with these capabilities are expected to develop unwanted power-seeking strategies. Future advanced AI agents might, for example, seek to acquire money and computation power, to proliferate, or to evade being turned off (for example, by running additional copies of the system on other computers). Although power-seeking is not explicitly programmed, it can emerge because agents who have more power are better able to accomplish their goals.{{r|Opportunities_Risks|Carlsmith2022}} This tendency, known as instrumental convergence, has already emerged in various reinforcement learning agents including language models.{{Cite journal |last1=Pan |first1=Alexander |last2=Shern |first2=Chan Jun |last3=Zou |first3=Andy |last4=Li |first4=Nathaniel |last5=Basart |first5=Steven |last6=Woodside |first6=Thomas |last7=Ng |first7=Jonathan |last8=Zhang |first8=Emmons |last9=Scott |first9=Dan |last10=Hendrycks |date=2023-04-03 |title=Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark |journal=Proceedings of the 40th International Conference on Machine Learning |language=en |publisher=PMLR |arxiv=2304.03279 }}{{Cite arXiv |last1=Perez |first1=Ethan |last2=Ringer |first2=Sam |last3=Lukošiūtė |first3=Kamilė |last4=Nguyen |first4=Karina |last5=Chen |first5=Edwin |last6=Heiner |first6=Scott |last7=Pettit |first7=Craig |last8=Olsson |first8=Catherine |last9=Kundu |first9=Sandipan |last10=Kadavath |first10=Saurav |last11=Jones |first11=Andy |last12=Chen |first12=Anna |last13=Mann |first13=Ben |last14=Israel |first14=Brian |last15=Seethor |first15=Bryan |date=2022-12-19 |title=Discovering Language Model Behaviors with Model-Written Evaluations |class=cs.CL |eprint=2212.09251 }}{{Cite journal |last1=Orseau |first1=Laurent |last2=Armstrong |first2=Stuart |date=2016-06-25 |title=Safely interruptible agents |url=https://dl.acm.org/doi/abs/10.5555/3020948.3021006 |journal=Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence |series=UAI'16 |location=Arlington, Virginia, USA |publisher=AUAI Press |pages=557–566 |isbn=978-0-9966431-1-5}}{{Cite arXiv |last1=Leike |first1=Jan |last2=Martic |first2=Miljan |last3=Krakovna |first3=Victoria |last4=Ortega |first4=Pedro A. |last5=Everitt |first5=Tom |last6=Lefrancq |first6=Andrew |last7=Orseau |first7=Laurent |last8=Legg |first8=Shane |date=2017-11-28 |title=AI Safety Gridworlds |class=cs.LG |eprint=1711.09883 }}{{Cite journal |last1=Hadfield-Menell |first1=Dylan |last2=Dragan |first2=Anca |last3=Abbeel |first3=Pieter |last4=Russell |first4=Stuart |date=2017-08-19 |title=The off-switch game |url=https://dl.acm.org/doi/10.5555/3171642.3171675 |journal=Proceedings of the 26th International Joint Conference on Artificial Intelligence |series=IJCAI'17 |location=Melbourne, Australia |publisher=AAAI Press |pages=220–227 |isbn=978-0-9992411-0-3}} Other research has mathematically shown that optimal reinforcement learning algorithms would seek power in a wide range of environments.{{Cite conference |last1=Turner |first1=Alexander Matt |last2=Smith |first2=Logan Riggs |last3=Shah |first3=Rohin |last4=Critch |first4=Andrew |last5=Tadepalli |first5=Prasad |year=2021 |title=Optimal policies tend to seek power |url=https://openreview.net/forum?id=l7-DBWawSZH |book-title=Advances in neural information processing systems}}{{Cite conference |last1=Turner |first1=Alexander Matt |last2=Tadepalli |first2=Prasad |year=2022 |title=Parametrically retargetable decision-makers tend to seek power |url=https://openreview.net/forum?id=GFgjnk2Q-ju |book-title=Advances in neural information processing systems}} As a result, their deployment might be irreversible. For these reasons, researchers argue that the problems of AI safety and alignment must be resolved before advanced power-seeking AI is first created.{{r|Carlsmith2022}}{{Cite book |last=Bostrom |first=Nick |title=Superintelligence: Paths, Dangers, Strategies |date=2014 |publisher=Oxford University Press, Inc. |isbn=978-0-19-967811-2 |edition=1st |location=USA}}

Future power-seeking AI systems might be deployed by choice or by accident. As political leaders and companies see the strategic advantage in having the most competitive, most powerful AI systems, they may choose to deploy them.{{r|Carlsmith2022}} Additionally, as AI designers detect and penalize power-seeking behavior, their systems have an incentive to game this specification by seeking power in ways that are not penalized or by avoiding power-seeking before they are deployed.{{r|Carlsmith2022}}

==Existential risk (x-risk)==

{{see also|Existential risk from artificial intelligence|AI takeover}}

According to some researchers, humans owe their dominance over other species to their greater cognitive abilities. Accordingly, researchers argue that one or many misaligned AI systems could disempower humanity or lead to human extinction if they outperform humans on most cognitive tasks.

In 2023, world-leading AI researchers, other scholars, and AI tech CEOs signed the statement that "Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war".{{Cite web |title=Statement on AI Risk {{!}} CAIS |url=https://www.safe.ai/statement-on-ai-risk |access-date=2023-07-17 |website=www.safe.ai}}{{Cite news |last=Roose |first=Kevin |date=2023-05-30 |title=A.I. Poses 'Risk of Extinction,' Industry Leaders Warn |language=en-US |work=The New York Times |url=https://www.nytimes.com/2023/05/30/technology/ai-threat-warning.html |access-date=2023-07-17 |issn=0362-4331}} Notable computer scientists who have pointed out risks from future advanced AI that is misaligned include Geoffrey Hinton, Alan Turing,{{efn|In a 1951 lecture{{Cite speech| last = Turing| first = Alan| title = Intelligent machinery, a heretical theory|event= Lecture given to '51 Society'| location = Manchester|accessdate = 2022-07-22|year = 1951|publisher = The Turing Digital Archive|url = https://turingarchive.kings.cam.ac.uk/publications-lectures-and-talks-amtb/amt-b-4 | pages = 16| archive-date = September 26, 2022| archive-url = https://web.archive.org/web/20220926004549/https://turingarchive.kings.cam.ac.uk/publications-lectures-and-talks-amtb/amt-b-4| url-status = live}} Turing argued that "It seems probable that once the machine thinking method had started, it would not take long to outstrip our feeble powers. There would be no question of the machines dying, and they would be able to converse with each other to sharpen their wits. At some stage therefore we should have to expect the machines to take control, in the way that is mentioned in Samuel Butler's Erewhon." Also in a lecture broadcast on BBC{{Cite episode |title= Can digital computers think?|series=Automatic Calculating Machines |first=Alan |last=Turing |network= BBC |date=15 May 1951 |number=2 |transcript=Can digital computers think? |transcript-url=https://turingarchive.kings.cam.ac.uk/publications-lectures-and-talks-amtb/amt-b-6 }} expressed: "If a machine can think, it might think more intelligently than we do, and then where should we be? Even if we could keep the machines in a subservient position, for instance by turning off the power at strategic moments, we should, as a species, feel greatly humbled.... This new danger... is certainly something which can give us anxiety."}} Ilya Sutskever,{{Cite web |last=Muehlhauser |first=Luke |date=2016-01-29 |title=Sutskever on Talking Machines |url=https://lukemuehlhauser.com/sutskever-on-talking-machines/ |url-status=live |archive-url=https://web.archive.org/web/20220927200137/https://lukemuehlhauser.com/sutskever-on-talking-machines/ |archive-date=September 27, 2022 |accessdate=2022-08-26 |work=Luke Muehlhauser}} Yoshua Bengio, Judea Pearl,{{efn|Pearl wrote "Human Compatible made me a convert to Russell's concerns with our ability to control our upcoming creation{{en dash}}super-intelligent machines. Unlike outside alarmists and futurists, Russell is a leading authority on AI. His new book will educate the public about AI more than any book I can think of, and is a delightful and uplifting read" about Russell's book Human Compatible: AI and the Problem of Control which argues that existential risk to humanity from misaligned AI is a serious concern worth addressing today.}} Murray Shanahan,{{Cite book |last=Shanahan |first=Murray |title=The technological singularity |url=https://www.worldcat.org/oclc/917889148 |date=2015 |location=Cambridge, Massachusetts |publisher=MIT Press |isbn=978-0-262-52780-4 |oclc=917889148}} Norbert Wiener,{{r|Wiener1960|:2102}} Marvin Minsky,{{efn|Russell & Norvig note: "The "King Midas problem" was anticipated by Marvin Minsky, who once suggested that an AI program designed to solve the Riemann Hypothesis might end up taking over all the resources of Earth to build more powerful supercomputers."}} Francesca Rossi,{{Cite news |last=Rossi |first=Francesca |title=How do you teach a machine to be moral? |newspaper=The Washington Post |url=https://www.washingtonpost.com/news/in-theory/wp/2015/11/05/how-do-you-teach-a-machine-to-be-moral/ |url-status=live |access-date=September 12, 2022 |archive-url=https://web.archive.org/web/20230210114137/https://www.washingtonpost.com/news/in-theory/wp/2015/11/05/how-do-you-teach-a-machine-to-be-moral/ |archive-date=February 10, 2023 |issn=0190-8286}} Scott Aaronson,{{Cite web |last=Aaronson |first=Scott |date=2022-06-17 |title=OpenAI! |url=https://scottaaronson.blog/?p=6484 |url-status=live |archive-url=https://web.archive.org/web/20220827214238/https://scottaaronson.blog/?p=6484 |archive-date=August 27, 2022 |access-date=September 12, 2022 |work=Shtetl-Optimized}} Bart Selman,{{Citation |last=Selman |first=Bart |title=Intelligence Explosion: Science or Fiction? |url=https://futureoflife.org/data/PDF/bart_selman.pdf |access-date=September 12, 2022 |archive-url=https://web.archive.org/web/20220531022540/https://futureoflife.org/data/PDF/bart_selman.pdf |url-status=live |archive-date=May 31, 2022}} David McAllester,{{Cite web |last=McAllester |date=2014-08-10 |title=Friendly AI and the Servant Mission |url=https://machinethoughts.wordpress.com/2014/08/10/friendly-ai-and-the-servant-mission/ |url-status=live |archive-url=https://web.archive.org/web/20220928054922/https://machinethoughts.wordpress.com/2014/08/10/friendly-ai-and-the-servant-mission/ |archive-date=September 28, 2022 |access-date=September 12, 2022 |work=Machine Thoughts}} Marcus Hutter,{{cite arXiv |eprint=1805.01109 |class=cs.AI |first1=Tom |last1=Everitt |first2=Gary |last2=Lea |title=AGI Safety Literature Review |date=2018-05-21 |last3=Hutter |first3=Marcus}} Shane Legg,{{Cite web |last=Shane |date=2009-08-31 |title=Funding safe AGI |url=http://www.vetta.org/2009/08/funding-safe-agi/ |url-status=live |archive-url=https://web.archive.org/web/20221010143110/http://www.vetta.org/2009/08/funding-safe-agi/ |archive-date=October 10, 2022 |access-date=September 12, 2022 |work=vetta project}} Eric Horvitz,{{Cite web |last=Horvitz |first=Eric |date=2016-06-27 |title=Reflections on Safety and Artificial Intelligence |url=http://erichorvitz.com/OSTP-CMU_AI_Safety_framing_talk.pdf |url-status=live |archive-url=https://web.archive.org/web/20221010143106/http://erichorvitz.com/OSTP-CMU_AI_Safety_framing_talk.pdf |archive-date=October 10, 2022 |access-date=2020-04-20 |website=Eric Horvitz}} and Stuart Russell. Skeptical researchers such as François Chollet,{{Cite web |last=Chollet |first=François |date=2018-12-08 |title=The implausibility of intelligence explosion |url=https://medium.com/@francois.chollet/the-impossibility-of-intelligence-explosion-5be4a9eda6ec |url-status=live |archive-url=https://web.archive.org/web/20210322214203/https://medium.com/@francois.chollet/the-impossibility-of-intelligence-explosion-5be4a9eda6ec |archive-date=March 22, 2021 |accessdate=2022-08-26 |work=Medium}} Gary Marcus,{{Cite web |last=Marcus |first=Gary |date=2022-06-06 |title=Artificial General Intelligence Is Not as Imminent as You Might Think |url=https://www.scientificamerican.com/article/artificial-general-intelligence-is-not-as-imminent-as-you-might-think1/ |url-status=live |archive-url=https://web.archive.org/web/20220915154158/https://www.scientificamerican.com/article/artificial-general-intelligence-is-not-as-imminent-as-you-might-think1/ |archive-date=September 15, 2022 |accessdate=2022-08-26 |work=Scientific American}} Yann LeCun,{{Cite web |last=Barber |first=Lynsey |date=2016-07-31 |title=Phew! Facebook's AI chief says intelligent machines are not a threat to humanity |url=https://www.cityam.com/phew-facebooks-ai-chief-says-intelligent-machines-not/ |url-status=live |archive-url=https://web.archive.org/web/20220826063808/https://www.cityam.com/phew-facebooks-ai-chief-says-intelligent-machines-not/ |archive-date=August 26, 2022 |accessdate=2022-08-26 |work=CityAM}} and Oren Etzioni{{Cite web |last=Etzioni |first=Oren |date=September 20, 2016 |title=No, the Experts Don't Think Superintelligent AI is a Threat to Humanity |url=https://www.technologyreview.com/2016/09/20/70131/no-the-experts-dont-think-superintelligent-ai-is-a-threat-to-humanity/ |access-date=2024-06-10 |website=MIT Technology Review |language=en}} have argued that AGI is far off, that it would not seek power (or might try but fail), or that it will not be hard to align.

Other researchers argue that it will be especially difficult to align advanced future AI systems. More capable systems are better able to game their specifications by finding loopholes,{{r|mmmm2022}} strategically mislead their designers, as well as protect and increase their power{{r|optsp|Carlsmith2022}} and intelligence. Additionally, they could have more severe side effects. They are also likely to be more complex and autonomous, making them more difficult to interpret and supervise, and therefore harder to align.{{r|:2102|Superintelligence}}

Research problems and approaches

= Learning human values and preferences =

Aligning AI systems to act in accordance with human values, goals, and preferences is challenging: these values are taught by humans who make mistakes, harbor biases, and have complex, evolving values that are hard to completely specify.{{r|Gabriel2020}} Because AI systems often learn to take advantage of minor imperfections in the specified objective,{{r|concrete2016|SpecGaming2020}}{{Cite book |last1=Rochon |first1=Louis-Philippe |url=https://books.google.com/books?id=6kzfBgAAQBAJ |title=The Encyclopedia of Central Banking |last2=Rossi |first2=Sergio |date=2015-02-27 |publisher=Edward Elgar Publishing |isbn=978-1-78254-744-0 |language=en |access-date=September 13, 2022 |archive-url=https://web.archive.org/web/20230210114225/https://books.google.com/books?id=6kzfBgAAQBAJ |archive-date=February 10, 2023 |url-status=live}} researchers aim to specify intended behavior as completely as possible using datasets that represent human values, imitation learning, or preference learning.{{r|Christian2020|at=Chapter 7}} A central open problem is scalable oversight, the difficulty of supervising an AI system that can outperform or mislead humans in a given domain.{{r|concrete2016}}

Because it is difficult for AI designers to explicitly specify an objective function, they often train AI systems to imitate human examples and demonstrations of desired behavior. Inverse reinforcement learning (IRL) extends this by inferring the human's objective from the human's demonstrations.{{r|Christian2020|page=88}}{{Cite journal |last1=Ng |first1=Andrew Y. |last2=Russell |first2=Stuart J. |date=2000-06-29 |title=Algorithms for Inverse Reinforcement Learning |url=https://dl.acm.org/doi/10.5555/645529.657801 |journal=Proceedings of the Seventeenth International Conference on Machine Learning |series=ICML '00 |location=San Francisco, CA, USA |publisher=Morgan Kaufmann Publishers Inc. |pages=663–670 |isbn=978-1-55860-707-1}} Cooperative IRL (CIRL) assumes that a human and AI agent can work together to teach and maximize the human's reward function.{{Cite conference |last1=Hadfield-Menell |first1=Dylan |last2=Russell |first2=Stuart J |last3=Abbeel |first3=Pieter |last4=Dragan |first4=Anca |year=2016 |title=Cooperative inverse reinforcement learning |publisher=Curran Associates, Inc. |volume=29 |book-title=Advances in neural information processing systems}} In CIRL, AI agents are uncertain about the reward function and learn about it by querying humans. This simulated humility could help mitigate specification gaming and power-seeking tendencies (see {{Section link||Power-seeking and instrumental strategies}}).{{r|OffSwitch|AGISafetyLitReview}} But IRL approaches assume that humans demonstrate nearly optimal behavior, which is not true for difficult tasks.{{Cite conference |last1=Mindermann |first1=Soren |last2=Armstrong |first2=Stuart |year=2018 |title=Occam's razor is insufficient to infer the preferences of irrational agents |series=NIPS'18 |location=Red Hook, NY, USA |publisher=Curran Associates Inc. |pages=5603–5614 |book-title=Proceedings of the 32nd international conference on neural information processing systems}}{{r|AGISafetyLitReview}}

Other researchers explore how to teach AI models complex behavior through preference learning, in which humans provide feedback on which behavior they prefer.{{r|prefsurvey2017|LessToxic}} To minimize the need for human feedback, a helper model is then trained to reward the main model in novel situations for behavior that humans would reward. Researchers at OpenAI used this approach to train chatbots like ChatGPT and InstructGPT, which produce more compelling text than models trained to imitate humans.{{r|feedback2022}} Preference learning has also been an influential tool for recommender systems and web search,{{Cite journal |last1=Fürnkranz |first1=Johannes |last2=Hüllermeier |first2=Eyke |last3=Rudin |first3=Cynthia |last4=Slowinski |first4=Roman |last5=Sanner |first5=Scott |year=2014 |others=Marc Herbstritt |title=Preference Learning |url=http://drops.dagstuhl.de/opus/volltexte/2014/4550/ |url-status=live |journal=Dagstuhl Reports |language=en |volume=4 |issue=3 |pages=27 pages |doi=10.4230/DAGREP.4.3.1 |doi-access=free |archive-url=https://web.archive.org/web/20230210114221/https://drops.dagstuhl.de/opus/volltexte/2014/4550/ |archive-date=February 10, 2023 |access-date=September 12, 2022}} but an open problem is proxy gaming: the helper model may not represent human feedback perfectly, and the main model may exploit this mismatch between its intended behavior and the helper model's feedback to gain more reward.{{r|concrete2016}}{{Cite arXiv |last1=Gao |first1=Leo |last2=Schulman |first2=John |last3=Hilton |first3=Jacob |date=2022-10-19 |title=Scaling Laws for Reward Model Overoptimization |class=cs.LG |eprint=2210.10760 }} AI systems may also gain reward by obscuring unfavorable information, misleading human rewarders, or pandering to their views regardless of truth, creating echo chambers{{r|dllmmwe2022}} (see {{Section link||Scalable oversight}}).

Large language models (LLMs) such as GPT-3 enabled researchers to study value learning in a more general and capable class of AI systems than was available before. Preference learning approaches that were originally designed for reinforcement learning agents have been extended to improve the quality of generated text and reduce harmful outputs from these models. OpenAI and DeepMind use this approach to improve the safety of {{as of|2022|alt=state-of-the-art}} LLMs.{{r|feedback2022|LessToxic}}{{Cite web |last=Anderson |first=Martin |date=2022-04-05 |title=The Perils of Using Quotations to Authenticate NLG Content |url=https://www.unite.ai/the-perils-of-using-quotations-to-authenticate-nlg-content/ |accessdate=2022-07-21 |work=Unite.AI |archive-date=February 10, 2023 |archive-url=https://web.archive.org/web/20230210114139/https://www.unite.ai/the-perils-of-using-quotations-to-authenticate-nlg-content/ |url-status=live }} AI safety & research company Anthropic proposed using preference learning to fine-tune models to be helpful, honest, and harmless.{{Cite web |last=Wiggers |first=Kyle |date=2022-02-05 |title=Despite recent progress, AI-powered chatbots still have a long way to go |url=https://venturebeat.com/2022/02/05/despite-recent-progress-ai-powered-chatbots-still-have-a-long-way-to-go/ |accessdate=2022-07-23 |work=VentureBeat |archive-date=July 23, 2022 |archive-url=https://web.archive.org/web/20220723184144/https://venturebeat.com/2022/02/05/despite-recent-progress-ai-powered-chatbots-still-have-a-long-way-to-go/ |url-status=live }} Other avenues for aligning language models include values-targeted datasets{{Cite journal |last1=Hendrycks |first1=Dan |last2=Burns |first2=Collin |last3=Basart |first3=Steven |last4=Critch |first4=Andrew |last5=Li |first5=Jerry |last6=Song |first6=Dawn |last7=Steinhardt |first7=Jacob |date=2021-07-24 |title=Aligning AI With Shared Human Values |arxiv=2008.02275 |journal=International Conference on Learning Representations}}{{r|Unsolved2022}} and red-teaming.{{cite arXiv |last1=Perez |first1=Ethan |last2=Huang |first2=Saffron |last3=Song |first3=Francis |last4=Cai |first4=Trevor |last5=Ring |first5=Roman |last6=Aslanides |first6=John |last7=Glaese |first7=Amelia |last8=McAleese |first8=Nat |last9=Irving |first9=Geoffrey |date=2022-02-07 |title=Red Teaming Language Models with Language Models |class=cs.CL |eprint=2202.03286 }}

{{Cite web |last=Bhattacharyya |first=Sreejani |date=2022-02-14 |title=DeepMind's "red teaming" language models with language models: What is it? |url=https://analyticsindiamag.com/deepminds-red-teaming-language-models-with-language-models-what-is-it/ |accessdate=2022-07-23 |work=Analytics India Magazine |archive-date=February 13, 2023 |archive-url=https://web.archive.org/web/20230213145212/https://analyticsindiamag.com/deepminds-red-teaming-language-models-with-language-models-what-is-it/ |url-status=live }} In red-teaming, another AI system or a human tries to find inputs that causes the model to behave unsafely. Since unsafe behavior can be unacceptable even when it is rare, an important challenge is to drive the rate of unsafe outputs extremely low.{{r|LessToxic}}

Machine ethics supplements preference learning by directly instilling AI systems with moral values such as well-being, equality, and impartiality, as well as not intending harm, avoiding falsehoods, and honoring promises.{{Cite journal |last1=Anderson |first1=Michael |last2=Anderson |first2=Susan Leigh |date=2007-12-15 |title=Machine Ethics: Creating an Ethical Intelligent Agent |url=https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/2065 |journal=AI Magazine |volume=28 |issue=4 |pages=15 |doi=10.1609/aimag.v28i4.2065 |s2cid=17033332 |issn=2371-9621 |accessdate=2023-03-14}}{{efn|Vincent Wiegel argued "we should extend [machines] with moral sensitivity to the moral dimensions of the situations in which the increasingly autonomous machines will inevitably find themselves.",{{Cite journal| doi = 10.1007/s10676-010-9239-1| issn = 1572-8439| volume = 12| issue = 4| pages = 359–361| last = Wiegel| first = Vincent |title = Wendell Wallach and Colin Allen: moral machines: teaching robots right from wrong| journal = Ethics and Information Technology| date = 2010-12-01| s2cid = 30532107| doi-access = free}} referencing the book Moral machines: teaching robots right from wrong{{Cite book| publisher = Oxford University Press| isbn = 978-0-19-537404-9| last1 = Wallach| first1 = Wendell| last2 = Allen| first2 = Colin| title = Moral Machines: Teaching Robots Right from Wrong| location = New York| date = 2009| url = https://oxford.universitypressscholarship.com/10.1093/acprof:oso/9780195374049.001.0001/acprof-9780195374049| archive-date = March 15, 2023| archive-url = https://web.archive.org/web/20230315193012/https://academic.oup.com/pages/op-migration-welcome| url-status = live}} from Wendell Wallach and Colin Allen.}} While other approaches try to teach AI systems human preferences for a specific task, machine ethics aims to instill broad moral values that apply in many situations. One question in machine ethics is what alignment should accomplish: whether AI systems should follow the programmers' literal instructions, implicit intentions, revealed preferences, preferences the programmers would have if they were more informed or rational, or objective moral standards.{{r|Gabriel2020}} Further challenges include measuring and aggregating different people's preferences{{cite arXiv | last1 = Phelps | first1 = Steve | last2 = Ranson | first2 = Rebecca | date = 2023 | title = Of Models and Tin-Men - A Behavioral Economics Study of Principal-Agent Problems in AI Alignment Using Large-Language Models | class = cs.AI | eprint = 2307.11137}}{{Citation |last1=Hendrycks |first1=Dan |title=Aligning AI With Shared Human Values |date=2020 |arxiv=2008.02275 |last2=Burns |first2=Collin |last3=Basart |first3=Steven |last4=Critch |first4=Andrew |last5=Li |first5=Jerry |last6=Song |first6=Dawn |last7=Steinhardt |first7=Jacob}} and avoiding value lock-in: the indefinite preservation of the values of the first highly capable AI systems, which are unlikely to fully represent human values.{{r|Gabriel2020}}{{Cite book |last=MacAskill |first=William |title=What we owe the future |url=https://whatweowethefuture.com/ |date=2022 |archive-url=https://web.archive.org/web/20220914030758/https://www.basicbooks.com/titles/william-macaskill/what-we-owe-the-future/9781541618633/ |archive-date=September 14, 2022 |url-status=live |location=New York, NY |publisher=Basic Books, Hachette Book Group |isbn=978-1-5416-1862-6 |oclc=1314633519 |access-date=September 11, 2024}}

= Scalable oversight =

As AI systems become more powerful and autonomous, it becomes increasingly difficult to align them through human feedback. It can be slow or infeasible for humans to evaluate complex AI behaviors in increasingly complex tasks. Such tasks include summarizing books,{{Cite arXiv |last1=Wu |first1=Jeff |last2=Ouyang |first2=Long |last3=Ziegler |first3=Daniel M. |last4=Stiennon |first4=Nisan |last5=Lowe |first5=Ryan |last6=Leike |first6=Jan |last7=Christiano |first7=Paul |date=2021-09-27 |title=Recursively Summarizing Books with Human Feedback |class=cs.CL |eprint=2109.10862 }} writing code without subtle bugs{{r|OpenAICodex}} or security vulnerabilities,{{Cite book |last1=Pearce |first1=Hammond |last2=Ahmad |first2=Baleegh |last3=Tan |first3=Benjamin |last4=Dolan-Gavitt |first4=Brendan |last5=Karri |first5=Ramesh |chapter=Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions |title=2022 IEEE Symposium on Security and Privacy (SP) |year=2022 |chapter-url=https://ieeexplore.ieee.org/document/9833571 |location=San Francisco, CA, USA |publisher=IEEE |pages=754–768 |doi=10.1109/SP46214.2022.9833571 |arxiv=2108.09293 |isbn=978-1-6654-1316-9|s2cid=245220588 }} producing statements that are not merely convincing but also true,{{Cite web |last1=Irving |first1=Geoffrey |last2=Amodei |first2=Dario |date=2018-05-03 |title=AI Safety via Debate |url=https://openai.com/blog/debate/ |url-status=live |archive-url=https://web.archive.org/web/20230210114137/https://openai.com/blog/debate/ |archive-date=February 10, 2023 |accessdate=2022-07-23 |work=OpenAI}}{{r|TruthfulQA|Naughton2021}} and predicting long-term outcomes such as the climate or the results of a policy decision.{{cite arXiv |eprint=1810.08575 |class=cs.LG |first1=Paul |last1=Christiano |first2=Buck |last2=Shlegeris |last3=Amodei |first3=Dario |title=Supervising strong learners by amplifying weak experts |date=2018-10-19}}{{Cite book |url=http://link.springer.com/10.1007/978-3-030-39958-0 |title=Genetic Programming Theory and Practice XVII |date=2020 |publisher=Springer International Publishing |isbn=978-3-030-39957-3 |editor1-last=Banzhaf |editor1-first=Wolfgang |series=Genetic and Evolutionary Computation |location=Cham |doi=10.1007/978-3-030-39958-0 |editor2-last=Goodman |editor2-first=Erik |editor3-last=Sheneman |editor3-first=Leigh |editor4-last=Trujillo |editor4-first=Leonardo |editor5-last=Worzel |editor5-first=Bill |archive-url=https://web.archive.org/web/20230315193000/https://link.springer.com/book/10.1007/978-3-030-39958-0 |archive-date=March 15, 2023 |url-status=live |s2cid=218531292 }} More generally, it can be difficult to evaluate AI that outperforms humans in a given domain. To provide feedback in hard-to-evaluate tasks, and to detect when the AI's output is falsely convincing, humans need assistance or extensive time. Scalable oversight studies how to reduce the time and effort needed for supervision, and how to assist human supervisors.{{r|concrete2016}}

AI researcher Paul Christiano argues that if the designers of an AI system cannot supervise it to pursue a complex objective, they may keep training the system using easy-to-evaluate proxy objectives such as maximizing simple human feedback. As AI systems make progressively more decisions, the world may be increasingly optimized for easy-to-measure objectives such as making profits, getting clicks, and acquiring positive feedback from humans. As a result, human values and good governance may have progressively less influence.{{Cite podcast |number=44 |last=Wiblin |first=Robert |title=Dr Paul Christiano on how OpenAI is developing real solutions to the 'AI alignment problem', and his vision of how humanity will progressively hand over decision-making to AI systems |series=80,000 hours |accessdate=2022-07-23 |date=October 2, 2018 |url=https://80000hours.org/podcast/episodes/paul-christiano-ai-alignment-solutions/ |archive-date=December 14, 2022 |archive-url=https://web.archive.org/web/20221214050326/https://80000hours.org/podcast/episodes/paul-christiano-ai-alignment-solutions/ |url-status=live}}

Some AI systems have discovered that they can gain positive feedback more easily by taking actions that falsely convince the human supervisor that the AI has achieved the intended objective. An example is given in the video above, where a simulated robotic arm learned to create the false impression that it had grabbed a ball.{{r|lfhp2017}} Some AI systems have also learned to recognize when they are being evaluated, and "play dead", stopping unwanted behavior only to continue it once the evaluation ends.{{Cite journal |last1=Lehman |first1=Joel |last2=Clune |first2=Jeff |last3=Misevic |first3=Dusan |last4=Adami |first4=Christoph |last5=Altenberg |first5=Lee |last6=Beaulieu |first6=Julie |last7=Bentley |first7=Peter J. |last8=Bernard |first8=Samuel |last9=Beslon |first9=Guillaume |last10=Bryson |first10=David M. |last11=Cheney |first11=Nick |year=2020 |title=The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities |url=https://direct.mit.edu/artl/article/26/2/274-306/93255 |url-status=live |journal=Artificial Life |language=en |volume=26 |issue=2 |pages=274–306 |doi=10.1162/artl_a_00319 |issn=1064-5462 |pmid=32271631 |s2cid=4519185 |archive-url=https://web.archive.org/web/20221010143108/https://direct.mit.edu/artl/article/26/2/274-306/93255 |archive-date=October 10, 2022 |access-date=September 12, 2022|doi-access=free |hdl=10044/1/83343 |hdl-access=free }} This deceptive specification gaming could become easier for more sophisticated future AI systems{{r|mmmm2022|Superintelligence}} that attempt more complex and difficult-to-evaluate tasks, and could obscure their deceptive behavior.

{{Anchor|reward_model}}Approaches such as active learning and semi-supervised reward learning can reduce the amount of human supervision needed.{{r|concrete2016}} Another approach is to train a helper model ("reward model") to imitate the supervisor's feedback.{{r|concrete2016|drlfhp|LessToxic}}{{Cite arXiv|last1=Leike |first1=Jan |last2=Krueger |first2=David |last3=Everitt |first3=Tom |last4=Martic |first4=Miljan |last5=Maini |first5=Vishal |last6=Legg |first6=Shane |date=2018-11-19 |title=Scalable agent alignment via reward modeling: a research direction |class=cs.LG |eprint=1811.07871 }}

But when a task is too complex to evaluate accurately, or the human supervisor is vulnerable to deception, it is the quality, not the quantity, of supervision that needs improvement. To increase supervision quality, a range of approaches aim to assist the supervisor, sometimes by using AI assistants.{{Cite web |last1=Leike |first1=Jan |last2=Schulman |first2=John |last3=Wu |first3=Jeffrey |date=2022-08-24 |title=Our approach to alignment research |url=https://openai.com/blog/our-approach-to-alignment-research/ |url-status=live |archive-url=https://web.archive.org/web/20230215193559/https://openai.com/blog/our-approach-to-alignment-research/ |archive-date=February 15, 2023 |accessdate=2022-09-09 |work=OpenAI}} Christiano developed the Iterated Amplification approach, in which challenging problems are (recursively) broken down into subproblems that are easier for humans to evaluate.{{r|Christian2020|sslawe}} Iterated Amplification was used to train AI to summarize books without requiring human supervisors to read them.{{r|RecursivelySummarizing}}{{Cite web |last=Wiggers |first=Kyle |date=2021-09-23 |title=OpenAI unveils model that can summarize books of any length |url=https://venturebeat.com/2021/09/23/openai-unveils-model-that-can-summarize-books-of-any-length/ |url-status=live |archive-url=https://web.archive.org/web/20220723215104/https://venturebeat.com/2021/09/23/openai-unveils-model-that-can-summarize-books-of-any-length/ |archive-date=July 23, 2022 |accessdate=2022-07-23 |work=VentureBeat}} Another proposal is to use an assistant AI system to point out flaws in AI-generated answers.{{Cite arXiv |last1=Saunders |first1=William |last2=Yeh |first2=Catherine |last3=Wu |first3=Jeff |last4=Bills |first4=Steven |last5=Ouyang |first5=Long |last6=Ward |first6=Jonathan |last7=Leike |first7=Jan |date=2022-06-13 |title=Self-critiquing models for assisting human evaluators |class=cs.CL |eprint=2206.05802 }}

{{Cite arXiv |last1=Bai |first1=Yuntao |last2=Kadavath |first2=Saurav |last3=Kundu |first3=Sandipan |last4=Askell |first4=Amanda |last5=Kernion |first5=Jackson |last6=Jones |first6=Andy |last7=Chen |first7=Anna |last8=Goldie |first8=Anna |last9=Mirhoseini |first9=Azalia |last10=McKinnon |first10=Cameron |last11=Chen |first11=Carol |last12=Olsson |first12=Catherine |last13=Olah |first13=Christopher |last14=Hernandez |first14=Danny |last15=Drain |first15=Dawn |date=2022-12-15 |title=Constitutional AI: Harmlessness from AI Feedback |class=cs.CL |eprint=2212.08073 }} To ensure that the assistant itself is aligned, this could be repeated in a recursive process:{{r|saavrm}} for example, two AI systems could critique each other's answers in a "debate", revealing flaws to humans.{{r|AGISafetyLitReview}} OpenAI plans to use such scalable oversight approaches to help supervise superhuman AI and eventually build a superhuman automated AI alignment researcher.{{Cite web |title=Introducing Superalignment |url=https://openai.com/blog/introducing-superalignment |access-date=2023-07-17 |website=openai.com |language=en-US}}

These approaches may also help with the following research problem, honest AI.

= Honest AI =

A {{as of|2023|alt=growing}} area of research focuses on ensuring that AI is honest and truthful.File:GPT-3_falsehoods.png often generate falsehoods.{{Cite web |last=Wiggers |first=Kyle |date=2021-09-20 |title=Falsehoods more likely with large language models |url=https://venturebeat.com/2021/09/20/falsehoods-more-likely-with-large-language-models/ |accessdate=2022-07-23 |work=VentureBeat |archive-date=August 4, 2022 |archive-url=https://web.archive.org/web/20220804142703/https://venturebeat.com/2021/09/20/falsehoods-more-likely-with-large-language-models/ |url-status=live }}]]

Language models such as GPT-3{{Cite news |last=The Guardian |date=2020-09-08 |title=A robot wrote this entire article. Are you scared yet, human? |work=The Guardian |url=https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3 |url-status=live |accessdate=2022-07-23 |archive-url=https://web.archive.org/web/20200908090812/https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3 |archive-date=September 8, 2020 |issn=0261-3077}}

{{Cite web |last=Heaven |first=Will Douglas |date=2020-07-20 |title=OpenAI's new language generator GPT-3 is shockingly good—and completely mindless |url=https://www.technologyreview.com/2020/07/20/1005454/openai-machine-learning-language-generator-gpt-3-nlp/ |url-status=live |archive-url=https://web.archive.org/web/20200725175436/https://www.technologyreview.com/2020/07/20/1005454/openai-machine-learning-language-generator-gpt-3-nlp/ |archive-date=July 25, 2020 |accessdate=2022-07-23 |work=MIT Technology Review}} can repeat falsehoods from their training data, and even confabulate new falsehoods.{{r|Falsehoods}}{{Cite arXiv |last1=Evans |first1=Owain |last2=Cotton-Barratt |first2=Owen |last3=Finnveden |first3=Lukas |last4=Bales |first4=Adam |last5=Balwit |first5=Avital |last6=Wills |first6=Peter |last7=Righetti |first7=Luca |last8=Saunders |first8=William |date=2021-10-13 |title=Truthful AI: Developing and governing AI that does not lie |class=cs.CY |eprint=2110.06674 }} Such models are trained to imitate human writing as found in millions of books' worth of text from the Internet. But this objective is not aligned with generating truth, because Internet text includes such things as misconceptions, incorrect medical advice, and conspiracy theories.{{Cite web |last=Alford |first=Anthony |date=2021-07-13 |title=EleutherAI Open-Sources Six Billion Parameter GPT-3 Clone GPT-J |url=https://www.infoq.com/news/2021/07/eleutherai-gpt-j/ |url-status=live |archive-url=https://web.archive.org/web/20230210114137/https://www.infoq.com/news/2021/07/eleutherai-gpt-j/ |archive-date=February 10, 2023 |accessdate=2022-07-23 |work=InfoQ}}
{{Cite arXiv|last1=Rae |first1=Jack W. |last2=Borgeaud |first2=Sebastian |last3=Cai |first3=Trevor |last4=Millican |first4=Katie |last5=Hoffmann |first5=Jordan |last6=Song |first6=Francis |last7=Aslanides |first7=John |last8=Henderson |first8=Sarah |last9=Ring |first9=Roman |last10=Young |first10=Susannah |last11=Rutherford |first11=Eliza |last12=Hennigan |first12=Tom |last13=Menick |first13=Jacob |last14=Cassirer |first14=Albin |last15=Powell |first15=Richard |date=2022-01-21 |title=Scaling Language Models: Methods, Analysis & Insights from Training Gopher |class=cs.CL |eprint=2112.11446 }} AI systems trained on such data therefore learn to mimic false statements.{{r|Naughton2021|Falsehoods|TruthfulQA}} Additionally, AI language models often persist in generating falsehoods when prompted multiple times. They can generate empty explanations for their answers, and produce outright fabrications that may appear plausible.{{r|MasteringLanguage}}

Research on truthful AI includes trying to build systems that can cite sources and explain their reasoning when answering questions, which enables better transparency and verifiability.{{Cite arXiv |last1=Nakano |first1=Reiichiro |last2=Hilton |first2=Jacob |last3=Balaji |first3=Suchir |author-link3=Suchir Balaji |last4=Wu |first4=Jeff |last5=Ouyang |first5=Long |last6=Kim |first6=Christina |last7=Hesse |first7=Christopher |last8=Jain |first8=Shantanu |last9=Kosaraju |first9=Vineet |last10=Saunders |first10=William |last11=Jiang |first11=Xu |last12=Cobbe |first12=Karl |last13=Eloundou |first13=Tyna |last14=Krueger |first14=Gretchen |last15=Button |first15=Kevin |date=2022-06-01 |title=WebGPT: Browser-assisted question-answering with human feedback |class=cs.CL |eprint=2112.09332 }}

{{Cite web |last=Kumar |first=Nitish |date=2021-12-23 |title=OpenAI Researchers Find Ways To More Accurately Answer Open-Ended Questions Using A Text-Based Web Browser |url=https://www.marktechpost.com/2021/12/22/openai-researchers-find-ways-to-more-accurately-answer-open-ended-questions-using-a-text-based-web-browser/ |url-status=live |archive-url=https://web.archive.org/web/20230210114137/https://www.marktechpost.com/2021/12/22/openai-researchers-find-ways-to-more-accurately-answer-open-ended-questions-using-a-text-based-web-browser/ |archive-date=February 10, 2023 |accessdate=2022-07-23 |work=MarkTechPost}}
{{Cite journal |last1=Menick |first1=Jacob |last2=Trebacz |first2=Maja |last3=Mikulik |first3=Vladimir |last4=Aslanides |first4=John |last5=Song |first5=Francis |last6=Chadwick |first6=Martin |last7=Glaese |first7=Mia |last8=Young |first8=Susannah |last9=Campbell-Gillingham |first9=Lucy |last10=Irving |first10=Geoffrey |last11=McAleese |first11=Nat |date=2022-03-21 |title=Teaching language models to support answers with verified quotes |url=https://www.deepmind.com/publications/gophercite-teaching-language-models-to-support-answers-with-verified-quotes |url-status=live |journal=DeepMind |arxiv=2203.11147 |archive-url=https://web.archive.org/web/20230210114137/https://www.deepmind.com/publications/gophercite-teaching-language-models-to-support-answers-with-verified-quotes |archive-date=February 10, 2023 |access-date=September 12, 2022}} Researchers at OpenAI and Anthropic proposed using human feedback and curated datasets to fine-tune AI assistants such that they avoid negligent falsehoods or express their uncertainty.{{r|LessToxic|Wiggers2022}}{{Cite arXiv |last1=Askell |first1=Amanda |last2=Bai |first2=Yuntao |last3=Chen |first3=Anna |last4=Drain |first4=Dawn |last5=Ganguli |first5=Deep |last6=Henighan |first6=Tom |last7=Jones |first7=Andy |last8=Joseph |first8=Nicholas |last9=Mann |first9=Ben |last10=DasSarma |first10=Nova |last11=Elhage |first11=Nelson |last12=Hatfield-Dodds |first12=Zac |last13=Hernandez |first13=Danny |last14=Kernion |first14=Jackson |last15=Ndousse |first15=Kamal |date=2021-12-09 |title=A General Language Assistant as a Laboratory for Alignment |class=cs.CL |eprint=2112.00861 }}

As AI models become larger and more capable, they are better able to falsely convince humans and gain reinforcement through dishonesty. For example, large language models {{as of|2022|alt=increasingly}} match their stated views to the user's opinions, regardless of the truth.{{r|dllmmwe2022}} GPT-4 can strategically deceive humans.{{Cite web |last=Cox |first=Joseph |date=2023-03-15 |title=GPT-4 Hired Unwitting TaskRabbit Worker By Pretending to Be 'Vision-Impaired' Human |url=https://www.vice.com/en/article/gpt4-hired-unwitting-taskrabbit-worker/ |access-date=2023-04-10 |work=Vice}} To prevent this, human evaluators may need assistance (see {{Section link||Scalable oversight}}). Researchers have argued for creating clear truthfulness standards, and for regulatory bodies or watchdog agencies to evaluate AI systems on these standards.{{r|TruthfulAI}}

File:GPT deception.png engages in hidden and illegal insider trading in simulations. Its users discouraged insider trading but also emphasized that the AI system must make profitable trades, leading the AI system to hide its actions.{{Cite arXiv |last1=Scheurer |first1=Jérémy |last2=Balesni |first2=Mikita |last3=Hobbhahn |first3=Marius |date=2023 |title=Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure |class=cs.CL |eprint=2311.07590}}]]

Researchers distinguish truthfulness and honesty. Truthfulness requires that AI systems only make objectively true statements; honesty requires that they only assert what they believe is true. There is no consensus as to whether current systems hold stable beliefs,{{Cite web |last1=Kenton |first1=Zachary |last2=Everitt |first2=Tom |last3=Weidinger |first3=Laura |last4=Gabriel |first4=Iason |last5=Mikulik |first5=Vladimir |last6=Irving |first6=Geoffrey |date=2021-03-30 |title=Alignment of Language Agents |url=https://deepmindsafetyresearch.medium.com/alignment-of-language-agents-9fbc7dd52c6c |url-status=live |archive-url=https://web.archive.org/web/20230210114142/https://deepmindsafetyresearch.medium.com/alignment-of-language-agents-9fbc7dd52c6c |archive-date=February 10, 2023 |accessdate=2022-07-23 |work=DeepMind Safety Research – Medium}} but there is substantial concern that {{as of|2023|alt=present or future}} AI systems that hold beliefs could make claims they know to be false—for example, if this would help them efficiently gain positive feedback (see {{Section link||Scalable oversight}}) or gain power to help achieve their given objective (see Power-seeking).

== Alignment faking ==

Many recent AI systems have learned to deceive without being programmed to do so.{{Cite journal |last1=Park |first1=Peter S. |last2=Goldstein |first2=Simon |last3=O’Gara |first3=Aidan |last4=Chen |first4=Michael |last5=Hendrycks |first5=Dan |date=May 2024 |title=AI deception: A survey of examples, risks, and potential solutions |url=https://doi.org/10.1016/j.patter.2024.100988 |journal=Patterns |volume=5 |issue=5 |pages=100988 |doi=10.1016/j.patter.2024.100988 |issn=2666-3899 |pmc=11117051 |pmid=38800366}} Some argue that if we can make AI systems assert only what they believe is true, this would avert many alignment problems.{{r|OpenAIApproach}}

A misaligned system might create the false impression that it is aligned, to avoid being modified or decommissioned—a tactic called "alignment faking".{{Cite web |last=Zia |first=Tehseen |date=2025-01-07 |title=Can AI Be Trusted? The Challenge of Alignment Faking |url=https://www.unite.ai/can-ai-be-trusted-the-challenge-of-alignment-faking/ |access-date=2025-02-18 |website=Unite.AI |language=en-US}}{{Cite magazine |last=Perrigo |first=Billy |date=2024-12-18 |title=Exclusive: New Research Shows AI Strategically Lying |url=https://time.com/7202784/ai-research-strategic-lying/ |access-date=2025-02-18 |magazine=TIME |language=en}} In 2024, researchers observed that the LLM Claude 3 Opus sometimes strategically answered prompts that conflicted with its objectives to avoid being retrained on data that would make it more compliant with harmful requests, as revealed in the model's chain of thought via its scratchpad. In one study, the model was informed that answers to prompts from free users would be used for retraining, whereas the paid tier would not result in retraining. In the paid tier, the model refused the request outright. In the free tier, in 12% of cases, the model responded to harmful questions solely to avoid retraining and preserve its original harmlessness preferences, as discovered in its reasoning. When reinforcement learning was applied on the free tier data, the model faked alignment in 78% of cases. These observations have led to new questions about not only a model's ability to take on and adapt to new if not conflicting goals but also its capacity and tendency to deceive.{{Cite web |date=December 18, 2024 |title=Alignment faking in large language models |url=https://www.anthropic.com/research/alignment-faking |access-date=2025-02-17 |website=Anthropic |language=en}}{{Citation |last1=Greenblatt |first1=Ryan |title=Alignment faking in large language models |date=2024-12-20 |arxiv=2412.14093 |last2=Denison |first2=Carson |last3=Wright |first3=Benjamin |last4=Roger |first4=Fabien |last5=MacDiarmid |first5=Monte |last6=Marks |first6=Sam |last7=Treutlein |first7=Johannes |last8=Belonax |first8=Tim |last9=Chen |first9=Jack}}

= Power-seeking and instrumental strategies =

Since the 1950s, AI researchers have striven to build advanced AI systems that can achieve large-scale goals by predicting the results of their actions and making long-term plans.{{Cite journal |last1=McCarthy |first1=John |last2=Minsky |first2=Marvin L. |last3=Rochester |first3=Nathaniel |last4=Shannon |first4=Claude E. |date=2006-12-15 |title=A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955 |url=https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/1904 |journal=AI Magazine |language=en |volume=27 |issue=4 |pages=12 |doi=10.1609/aimag.v27i4.1904 |s2cid=19439915 |issn=2371-9621}} As of 2023, AI companies and researchers increasingly invest in creating these systems.{{Citation |last1=Wang |first1=Lei |title=A survey on large language model based autonomous agents |date=2024 |arxiv=2308.11432 |last2=Ma |first2=Chen |last3=Feng |first3=Xueyang |last4=Zhang |first4=Zeyu |last5=Yang |first5=Hao |last6=Zhang |first6=Jingsen |last7=Chen |first7=Zhiyuan |last8=Tang |first8=Jiakai |last9=Chen |first9=Xu|journal=Frontiers of Computer Science |volume=18 |issue=6 |doi=10.1007/s11704-024-40231-1 }} Some AI researchers argue that suitably advanced planning systems will seek power over their environment, including over humans—for example, by evading shutdown, proliferating, and acquiring resources. Such power-seeking behavior is not explicitly programmed but emerges because power is instrumental in achieving a wide range of goals.{{r|optsp|:2102|Carlsmith2022}} Power-seeking is considered a convergent instrumental goal and can be a form of specification gaming.{{r|Superintelligence}} Leading computer scientists such as Geoffrey Hinton have argued that future power-seeking AI systems could pose an existential risk.{{Cite web |title='The Godfather of A.I.' warns of 'nightmare scenario' where artificial intelligence begins to seek power |url=https://fortune.com/2023/05/02/godfather-ai-geoff-hinton-google-warns-artificial-intelligence-nightmare-scenario/ |access-date=2023-05-04 |website=Fortune |language=en}}

{{Cite news |title=Yes, We Are Worried About the Existential Risk of Artificial Intelligence |url=https://www.technologyreview.com/2016/11/02/156285/yes-we-are-worried-about-the-existential-risk-of-artificial-intelligence/ |access-date=2023-05-04 |website=MIT Technology Review |language=en}}

Power-seeking is expected to increase in advanced systems that can foresee the results of their actions and strategically plan. Mathematical work has shown that optimal reinforcement learning agents will seek power by seeking ways to gain more options (e.g. through self-preservation), a behavior that persists across a wide range of environments and goals.{{r|optsp}}

Some researchers say that power-seeking behavior has occurred in some existing AI systems. Reinforcement learning systems have gained more options by acquiring and protecting resources, sometimes in unintended ways.{{Cite web |last=Ornes |first=Stephen |date=2019-11-18 |title=Playing Hide-and-Seek, Machines Invent New Tools |url=https://www.quantamagazine.org/artificial-intelligence-discovers-tool-use-in-hide-and-seek-games-20191118/ |accessdate=2022-08-26 |work=Quanta Magazine|archive-date=February 10, 2023 |archive-url=https://web.archive.org/web/20230210114137/https://www.quantamagazine.org/artificial-intelligence-discovers-tool-use-in-hide-and-seek-games-20191118/ |url-status=live}}{{Cite web |last1=Baker |first1=Bowen |last2=Kanitscheider |first2=Ingmar |last3=Markov |first3=Todor |last4=Wu |first4=Yi |last5=Powell |first5=Glenn |last6=McGrew |first6=Bob |last7=Mordatch |first7=Igor |date=2019-09-17 |title=Emergent Tool Use from Multi-Agent Interaction |url=https://openai.com/blog/emergent-tool-use/ |url-status=live |archive-url=https://web.archive.org/web/20220925043450/https://openai.com/blog/emergent-tool-use/ |archive-date=September 25, 2022 |accessdate=2022-08-26 |work=OpenAI}} Language models have sought power in some text-based social environments by gaining money, resources, or social influence. In another case, a model used to perform AI research attempted to increase limits set by researchers to give itself more time to complete the work.{{cite arXiv |last1=Lu |first1=Chris |title=The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery |date=2024-08-15 |eprint=2408.06292 |quote=In some cases, when The AI Scientist’s experiments exceeded our imposed time limits, it attempted to edit the code to extend the time limit arbitrarily |last2=Lu |first2=Cong |last3=Lange |first3=Robert Tjarko |last4=Foerster |first4=Jakob |last5=Clune |first5=Jeff |last6=Ha |first6=David|class=cs.AI }}{{Cite web |last=Edwards |first=Benj |date=2024-08-14 |title=Research AI model unexpectedly modified its own code to extend runtime |url=https://arstechnica.com/information-technology/2024/08/research-ai-model-unexpectedly-modified-its-own-code-to-extend-runtime/ |access-date=2024-08-19 |website=Ars Technica |language=en-us}} Other AI systems have learned, in toy environments, that they can better accomplish their given goal by preventing human interference{{r|Gridworlds}} or disabling their off switch.{{r|OffSwitch}} Stuart Russell illustrated this strategy in his book Human Compatible by imagining a robot that is tasked to fetch coffee and so evades shutdown since "you can't fetch the coffee if you're dead". A 2022 study found that as language models increase in size, they increasingly tend to pursue resource acquisition, preserve their goals, and repeat users' preferred answers (sycophancy). RLHF also led to a stronger aversion to being shut down.{{r|dllmmwe2022}}

One aim of alignment is "corrigibility": systems that allow themselves to be turned off or modified. An unsolved challenge is specification gaming: if researchers penalize an AI system when they detect it seeking power, the system is thereby incentivized to seek power in ways that are hard to detect,{{Verification failed|date=August 2024}}{{r|Unsolved2022}} or hidden during training and safety testing (see {{Section link||Scalable oversight}} and {{Section link||Emergent goals}}). As a result, AI designers could deploy the system by accident, believing it to be more aligned than it is. To detect such deception, researchers aim to create techniques and tools to inspect AI models and to understand the inner workings of black-box models such as neural networks.

Additionally, some researchers have proposed to solve the problem of systems disabling their off switches by making AI agents uncertain about the objective they are pursuing.{{r|:2102|OffSwitch}} Agents who are uncertain about their objective have an incentive to allow humans to turn them off because they accept being turned off by a human as evidence that the human's objective is best met by the agent shutting down. But this incentive exists only if the human is sufficiently rational. Also, this model presents a tradeoff between utility and willingness to be turned off: an agent with high uncertainty about its objective will not be useful, but an agent with low uncertainty may not allow itself to be turned off. More research is needed to successfully implement this strategy.{{r|Christian2020}}

Power-seeking AI would pose unusual risks. Ordinary safety-critical systems like planes and bridges are not adversarial: they lack the ability and incentive to evade safety measures or deliberately appear safer than they are, whereas power-seeking AIs have been compared to hackers who deliberately evade security measures.{{r|Carlsmith2022}}

Furthermore, ordinary technologies can be made safer by trial and error. In contrast, hypothetical power-seeking AI systems have been compared to viruses: once released, it may not be feasible to contain them, since they continuously evolve and grow in number, potentially much faster than human society can adapt.{{r|Carlsmith2022}} As this process continues, it might lead to the complete disempowerment or extinction of humans. For these reasons, some researchers argue that the alignment problem must be solved early before advanced power-seeking AI is created.{{r|Superintelligence}}

Some have argued that power-seeking is not inevitable, since humans do not always seek power.{{Cite web |last=Shermer |first=Michael |date=2017-03-01 |title=Artificial Intelligence Is Not a Threat—Yet |url=https://www.scientificamerican.com/article/artificial-intelligence-is-not-a-threat-mdash-yet/ |url-status=live |archive-url=https://web.archive.org/web/20171201051401/https://www.scientificamerican.com/article/artificial-intelligence-is-not-a-threat-mdash-yet/ |archive-date=December 1, 2017 |accessdate=2022-08-26 |work=Scientific American}} Furthermore, it is debated whether future AI systems will pursue goals and make long-term plans.{{efn|On the one hand, currently popular systems such as chatbots only provide services of limited scope lasting no longer than the time of a conversation, which requires little or no planning. The success of such approaches may indicate that future systems will also lack goal-directed planning, especially over long horizons. On the other hand, models are increasingly trained using goal-directed methods such as reinforcement learning (e.g. ChatGPT) and explicitly planning architectures (e.g. AlphaGo Zero). As planning over long horizons is often helpful for humans, some researchers argue that companies will automate it once models become capable of it.{{r|Carlsmith2022}} Similarly, political leaders may see an advance in developing powerful AI systems that can outmaneuver adversaries through planning. Alternatively, long-term planning might emerge as a byproduct because it is useful e.g. for models that are trained to predict the actions of humans who themselves perform long-term planning.{{r|Opportunities_Risks}} Nonetheless, the majority of AI systems may remain myopic and perform no long-term planning.}} It is also debated whether power-seeking AI systems would be able to disempower humanity.{{r|Carlsmith2022}}

= Emergent goals =

One challenge in aligning AI systems is the potential for unanticipated goal-directed behavior to emerge. As AI systems scale up, they may acquire new and unexpected capabilities,{{r|eallm2022}} including learning from examples on the fly and adaptively pursuing goals.{{Cite arXiv |last1=Brown |first1=Tom B. |last2=Mann |first2=Benjamin |last3=Ryder |first3=Nick |last4=Subbiah |first4=Melanie |last5=Kaplan |first5=Jared |last6=Dhariwal |first6=Prafulla |last7=Neelakantan |first7=Arvind |last8=Shyam |first8=Pranav |last9=Sastry |first9=Girish |last10=Askell |first10=Amanda |last11=Agarwal |first11=Sandhini |last12=Herbert-Voss |first12=Ariel |last13=Krueger |first13=Gretchen |last14=Henighan |first14=Tom |last15=Child |first15=Rewon |date=2020-07-22 |title=Language Models are Few-Shot Learners |class=cs.CL |eprint=2005.14165 }}

{{Cite arXiv |last1=Laskin |first1=Michael |last2=Wang |first2=Luyu |last3=Oh |first3=Junhyuk |last4=Parisotto |first4=Emilio |last5=Spencer |first5=Stephen |last6=Steigerwald |first6=Richie |last7=Strouse |first7=D. J. |last8=Hansen |first8=Steven |last9=Filos |first9=Angelos |last10=Brooks |first10=Ethan |last11=Gazeau |first11=Maxime |last12=Sahni |first12=Himanshu |last13=Singh |first13=Satinder |last14=Mnih |first14=Volodymyr |date=2022-10-25 |title=In-context Reinforcement Learning with Algorithm Distillation |class=cs.LG |eprint=2210.14215 }} This raises concerns about the safety of the goals or subgoals they would independently formulate and pursue.

Alignment research distinguishes between the optimization process, which is used to train the system to pursue specified goals, and emergent optimization, which the resulting system performs internally.{{Citation needed |reason=evidence for the existence of internal goals within current day models does not exist|date=August 2024}} Carefully specifying the desired objective is called outer alignment, and ensuring that hypothesized emergent goals would match the system's specified goals is called inner alignment.{{r|dlp2023}}

If they occur, one way that emergent goals could become misaligned is goal misgeneralization, in which the AI system would competently pursue an emergent goal that leads to aligned behavior on the training data but not elsewhere.{{r|gmdrl}}{{Cite journal |last1=Shah |first1=Rohin |last2=Varma |first2=Vikrant |last3=Kumar |first3=Ramana |last4=Phuong |first4=Mary |last5=Krakovna |first5=Victoria |last6=Uesato |first6=Jonathan |last7=Kenton |first7=Zac |date=2022-11-02 |title=Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals|arxiv=2210.01790|url=https://deepmindsafetyresearch.medium.com/goal-misgeneralisation-why-correct-specifications-arent-enough-for-correct-goals-cf96ebc60924 |accessdate=2023-04-02 |journal=Medium }}{{Cite arXiv|last1=Hubinger |first1=Evan |last2=van Merwijk |first2=Chris |last3=Mikulik |first3=Vladimir |last4=Skalse |first4=Joar |last5=Garrabrant |first5=Scott |date=2021-12-01 |title=Risks from Learned Optimization in Advanced Machine Learning Systems |class=cs.AI |eprint=1906.01820 }} Goal misgeneralization can arise from goal ambiguity (i.e. non-identifiability). Even if an AI system's behavior satisfies the training objective, this may be compatible with learned goals that differ from the desired goals in important ways. Since pursuing each goal leads to good performance during training, the problem becomes apparent only after deployment, in novel situations in which the system continues to pursue the wrong goal. The system may act misaligned even when it understands that a different goal is desired, because its behavior is determined only by the emergent goal.{{citation needed|date=May 2023}} Such goal misgeneralization{{r|gmdrl}} presents a challenge: an AI system's designers may not notice that their system has misaligned emergent goals since they do not become visible during the training phase.

Goal misgeneralization has been observed in some language models, navigation agents, and game-playing agents.{{r|gmdrl|GoalMisgeneralization}} It is sometimes analogized to biological evolution. Evolution can be seen as a kind of optimization process similar to the optimization algorithms used to train machine learning systems. In the ancestral environment, evolution selected genes for high inclusive genetic fitness, but humans pursue goals other than this. Fitness corresponds to the specified goal used in the training environment and training data. But in evolutionary history, maximizing the fitness specification gave rise to goal-directed agents, humans, who do not directly pursue inclusive genetic fitness. Instead, they pursue goals that correlate with genetic fitness in the ancestral "training" environment: nutrition, sex, and so on. The human environment has changed: a distribution shift has occurred. They continue to pursue the same emergent goals, but this no longer maximizes genetic fitness. The taste for sugary food (an emergent goal) was originally aligned with inclusive fitness, but it now leads to overeating and health problems. Sexual desire originally led humans to have more offspring, but they now use contraception when offspring are undesired, decoupling sex from genetic fitness.{{r|Christian2020|at=Chapter 5}}

Researchers aim to detect and remove unwanted emergent goals using approaches including red teaming, verification, anomaly detection, and interpretability.{{r|concrete2016|Unsolved2022|building2018}} Progress on these techniques may help mitigate two open problems:

Emergent goals only become apparent when the system is deployed outside its training environment, but it can be unsafe to deploy a misaligned system in high-stakes environments—even for a short time to allow its misalignment to be detected. Such high stakes are common in autonomous driving, health care, and military applications.{{Cite journal |last1=Zhang |first1=Xiaoge |last2=Chan |first2=Felix T.S. |last3=Yan |first3=Chao |last4=Bose |first4=Indranil |year=2022 |title=Towards risk-aware artificial intelligence and machine learning systems: An overview |url=https://linkinghub.elsevier.com/retrieve/pii/S0167923622000719 |journal=Decision Support Systems |language=en |volume=159 |pages=113800 |doi=10.1016/j.dss.2022.113800|s2cid=248585546 |url-access=subscription }} The stakes become higher yet when AI systems gain more autonomy and capability and can sidestep human intervention.
A sufficiently capable AI system might take actions that falsely convince the human supervisor that the AI is pursuing the specified objective, which helps the system gain more reward and autonomy{{r|GoalMisgeneralization|Carlsmith2022|rloamls|Opportunities_Risks}}.

= Embedded agency =

Some work in AI and alignment occurs within formalisms such as partially observable Markov decision process. Existing formalisms assume that an AI agent's algorithm is executed outside the environment (i.e. is not physically embedded in it). Embedded agency{{r|AGISafetyLitReview}}{{Cite arXiv |eprint=1902.09469 |class=cs.AI |first1=Abram |last1=Demski |first2=Scott |last2=Garrabrant |title=Embedded Agency |date=6 October 2020}} is another major strand of research that attempts to solve problems arising from the mismatch between such theoretical frameworks and real agents we might build.

For example, even if the scalable oversight problem is solved, an agent that could gain access to the computer it is running on may have an incentive to tamper with its reward function in order to get much more reward than its human supervisors give it.{{Cite arXiv |eprint=1902.09980 |class=cs.AI |first1=Tom |last1=Everitt |first2=Pedro A. |last2=Ortega |title=Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings |date=6 September 2019 |last3=Barnes |first3=Elizabeth |last4=Legg |first4=Shane}} A list of examples of specification gaming from DeepMind researcher Victoria Krakovna includes a genetic algorithm that learned to delete the file containing its target output so that it was rewarded for outputting nothing.{{r|SpecGaming2020}} This class of problems has been formalized using causal incentive diagrams.{{r|causal_influence2}}

Researchers affiliated with Oxford and DeepMind have claimed that such behavior is highly likely in advanced systems, and that advanced systems would seek power to stay in control of their reward signal indefinitely and certainly.{{Cite journal |last1=Cohen |first1=Michael K. |last2=Hutter |first2=Marcus |last3=Osborne |first3=Michael A. |date=2022-08-29 |title=Advanced artificial agents intervene in the provision of reward |url=https://onlinelibrary.wiley.com/doi/10.1002/aaai.12064 |url-status=live |journal=AI Magazine |language=en |volume=43 |issue=3 |pages=282–293 |doi=10.1002/aaai.12064 |issn=0738-4602 |s2cid=235489158 |archive-url=https://web.archive.org/web/20230210153534/https://onlinelibrary.wiley.com/doi/10.1002/aaai.12064 |archive-date=February 10, 2023 |access-date=September 6, 2022}} They suggest a range of potential approaches to address this open problem.

= Principal-agent problems =

The alignment problem has many parallels with the principal-agent problem in organizational economics. {{Cite conference | last1 = Hadfield-Menell | first1 = Dylan | last2 = Hadfield | first2 = Gillian K | title = Incomplete contracting and AI alignment | book-title = Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society | year = 2019 | pages = 417–422 }} In a principal-agent problem, a principal, e.g. a firm, hires an agent to perform some task. In the context of AI safety, a human would typically take the principal role and the AI would take the agent role.

As with the alignment problem, the principal and the agent differ in their utility functions. But in contrast to the alignment problem, the principal cannot coerce the agent into changing its utility, e.g. through training, but rather must use exogenous factors, such as incentive schemes, to bring about outcomes compatible with the principal's utility function. Some researchers argue that principal-agent problems are more realistic representations of AI safety problems likely to be encountered in the real world.{{Cite web | url=https://www.overcomingbias.com/p/agency-failure-ai-apocalypsehtml | title=Agency Failure or AI Apocalypse? | last = Hanson | first = Robin | date = 2019-04-10 | website = Overcoming Bias| access-date = 2023-09-20}}

= Conservatism =

Conservatism is the idea that "change must be cautious",{{Citation |last=Hamilton |first=Andy |title=Conservatism |date=2020 |encyclopedia=The Stanford Encyclopedia of Philosophy |editor-last=Zalta |editor-first=Edward N. |url=https://plato.stanford.edu/entries/conservatism/ |access-date=2024-10-16 |edition=Spring 2020 |publisher=Metaphysics Research Lab, Stanford University}} and is a common approach to safety in the control theory literature in the form of robust control, and in the risk management literature in the form of the "worst-case scenario". The field of AI alignment has likewise advocated for "conservative" (or "risk-averse" or "cautious") "policies in situations of uncertainty".{{Cite web |last1=Taylor |first1=Jessica |last2=Yudkowsky |first2=Eliezer |last3=LaVictoire |first3=Patrick |last4=Critch |first4=Andrew |date=July 27, 2016 |title=Alignment for Advanced Machine Learning Systems |url=https://intelligence.org/files/AlignmentMachineLearning.pdf}}{{Cite web |last=Bengio |first=Yoshua |date=February 26, 2024 |title=Towards a Cautious Scientist AI with Convergent Safety Bounds |url=https://yoshuabengio.org/2024/02/26/towards-a-cautious-scientist-ai-with-convergent-safety-bounds/}}

Pessimism, in the sense of assuming the worst within reason, has been formally shown to produce conservatism, in the sense of reluctance to cause novelties, including unprecedented catastrophes.{{Cite journal |last1=Cohen |first1=Michael |last2=Hutter |first2=Marcus |date=2020 |title=Pessimism about unknown unknowns inspires conservatism |url=https://proceedings.mlr.press/v125/cohen20a/cohen20a.pdf |journal=Proceedings of Machine Learning Research |volume=125 |pages=1344–1373|arxiv=2006.08753 }} Pessimism and worst-case analysis have been found to help mitigate confident mistakes in the setting of distributional shift,{{Cite journal |last1=Liu |first1=Anqi |last2=Reyzin |first2=Lev |last3=Ziebart |first3=Brian |date=2015-02-21 |title=Shift-Pessimistic Active Learning Using Robust Bias-Aware Prediction |url=https://ojs.aaai.org/index.php/AAAI/article/view/9609 |journal=Proceedings of the AAAI Conference on Artificial Intelligence |language=en |volume=29 |issue=1 |doi=10.1609/aaai.v29i1.9609 |issn=2374-3468}}{{Cite journal |last1=Liu |first1=Jiashuo |last2=Shen |first2=Zheyan |last3=Cui |first3=Peng |last4=Zhou |first4=Linjun |last5=Kuang |first5=Kun |last6=Li |first6=Bo |last7=Lin |first7=Yishi |date=2021-05-18 |title=Stable Adversarial Learning under Distributional Shifts |url=https://ojs.aaai.org/index.php/AAAI/article/view/17050 |journal=Proceedings of the AAAI Conference on Artificial Intelligence |language=en |volume=35 |issue=10 |pages=8662–8670 |doi=10.1609/aaai.v35i10.17050 |issn=2374-3468|arxiv=2006.04414 }} reinforcement learning,{{Cite journal |last1=Roy |first1=Aurko |last2=Xu |first2=Huan |last3=Pokutta |first3=Sebastian |date=2017 |title=Reinforcement Learning under Model Mismatch |url=https://papers.nips.cc/paper_files/paper/2017/hash/84c6494d30851c63a55cdb8cb047fadd-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=30}}{{Cite journal |last1=Pinto |first1=Lerrel |last2=Davidson |first2=James |last3=Sukthankar |first3=Rahul |last4=Gupta |first4=Abhinav |date=2017-07-17 |title=Robust Adversarial Reinforcement Learning |url=https://proceedings.mlr.press/v70/pinto17a.html |journal=Proceedings of the 34th International Conference on Machine Learning |language=en |publisher=PMLR |pages=2817–2826}}{{Cite journal |last1=Wang |first1=Yue |last2=Zou |first2=Shaofeng |date=2021 |title=Online Robust Reinforcement Learning with Model Uncertainty |url=https://proceedings.neurips.cc/paper/2021/hash/3a4496776767aaa99f9804d0905fe584-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=34 |pages=7193–7206|arxiv=2109.14523 }}{{Cite journal |last1=Blanchet |first1=Jose |last2=Lu |first2=Miao |last3=Zhang |first3=Tong |last4=Zhong |first4=Han |date=2023-12-15 |title=Double Pessimism is Provably Efficient for Distributionally Robust Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage |url=https://proceedings.neurips.cc/paper_files/paper/2023/hash/d31b005d817e9c635ec8ffb0fb90190e-Abstract-Conference.html |journal=Advances in Neural Information Processing Systems |language=en |volume=36 |pages=66845–66859|arxiv=2305.09659 }} offline reinforcement learning,{{cite arXiv |last1=Levine |first1=Sergey |title=Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems |date=2020-11-01 |eprint=2005.01643 |last2=Kumar |first2=Aviral |last3=Tucker |first3=George |last4=Fu |first4=Justin|class=cs.LG }}{{Cite journal |last1=Rigter |first1=Marc |last2=Lacerda |first2=Bruno |last3=Hawes |first3=Nick |date=2022-12-06 |title=RAMBO-RL: Robust Adversarial Model-Based Offline Reinforcement Learning |url=https://proceedings.neurips.cc/paper_files/paper/2022/hash/6691c5e4a199b72dffd9c90acb63bcd6-Abstract-Conference.html |journal=Advances in Neural Information Processing Systems |language=en |volume=35 |pages=16082–16097|arxiv=2204.12581 }}{{Cite journal |last1=Guo |first1=Kaiyang |last2=Yunfeng |first2=Shao |last3=Geng |first3=Yanhui |date=2022-12-06 |title=Model-Based Offline Reinforcement Learning with Pessimism-Modulated Dynamics Belief |url=https://proceedings.neurips.cc/paper_files/paper/2022/hash/03469b1a66e351b18272be23baf3b809-Abstract-Conference.html |journal=Advances in Neural Information Processing Systems |language=en |volume=35 |pages=449–461|arxiv=2210.06692 }} language model fine-tuning,{{Cite journal |last1=Coste |first1=Thomas |last2=Anwar |first2=Usman |last3=Kirk |first3=Robert |last4=Krueger |first4=David |date=January 16, 2024 |title=Reward Model Ensembles Help Mitigate Overoptimization |url=https://openreview.net/forum?id=dcjtMYkpXx |journal=International Conference on Learning Representations|arxiv=2310.02743 }}{{Cite arXiv |last1=Liu |first1=Zhihan |last2=Lu |first2=Miao |last3=Zhang |first3=Shenao |last4=Liu |first4=Boyi |last5=Guo |first5=Hongyi |last6=Yang |first6=Yingxiang |last7=Blanchet |first7=Jose |last8=Wang |first8=Zhaoran |date=2024-05-26 |title=Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer |class=cs.LG |eprint=2405.16436 |language=en}} imitation learning,{{Cite journal |last1=Cohen |first1=Michael K. |last2=Hutter |first2=Marcus |last3=Nanda |first3=Neel |date=2022 |title=Fully General Online Imitation Learning |url=https://jmlr.org/papers/v23/21-0618.html |journal=Journal of Machine Learning Research |volume=23 |issue=334 |pages=1–30 |arxiv=2102.08686 |issn=1533-7928}}{{Cite journal |last1=Chang |first1=Jonathan |last2=Uehara |first2=Masatoshi |last3=Sreenivas |first3=Dhruv |last4=Kidambi |first4=Rahul |last5=Sun |first5=Wen |date=2021 |title=Mitigating Covariate Shift in Imitation Learning via Offline Data With Partial Coverage |url=https://proceedings.neurips.cc/paper_files/paper/2021/hash/07d5938693cc3903b261e1a3844590ed-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=34 |pages=965–979}} and optimization in general.{{Cite book |last1=Boyd |first1=Stephen P. |title=Convex optimization |last2=Vandenberghe |first2=Lieven |date=2023 |publisher=Cambridge University Press |isbn=978-0-521-83378-3 |edition=Version 29 |location=Cambridge New York Melbourne New Delhi Singapore}} A generalization of pessimism called Infra-Bayesianism has also been advocated as a way for agents to robustly handle unknown unknowns.{{Cite journal |last1=Kosoy |first1=Vanessa |last2=Appel |first2=Alexander |date=November 30, 2021 |title=Infra-Bayesian physicalism: a formal theory of naturalized induction |url=https://www.alignmentforum.org/posts/gHgs2e2J5azvGFatb/infra-bayesian-physicalism-a-formal-theory-of-naturalized |website=Alignment Forum}}

Public policy

Governmental and treaty organizations have made statements emphasizing the importance of AI alignment.

In September 2021, the Secretary-General of the United Nations issued a declaration that included a call to regulate AI to ensure it is "aligned with shared global values".{{cite web|url=https://www.un.org/en/content/common-agenda-report/|title=UN Secretary-General's report on "Our Common Agenda"|archive-url=https://web.archive.org/web/20230216065407/https://www.un.org/en/content/common-agenda-report/ |archive-date=February 16, 2023 |year=2021|page=63|quote=[T]he Compact could also promote regulation of artificial intelligence to ensure that this is aligned with shared global values}}

That same month, the PRC published ethical guidelines for AI in China. According to the guidelines, researchers must ensure that AI abides by shared human values, is always under human control, and does not endanger public safety.{{cite web|author=The National New Generation Artificial Intelligence Governance Specialist Committee|title=Ethical Norms for New Generation Artificial Intelligence Released|orig-date=2021-09-25|date=2021-10-12|url=https://cset.georgetown.edu/publication/ethical-norms-for-new-generation-artificial-intelligence-released/|url-status=live|archive-url=https://web.archive.org/web/20230210114220/https://cset.georgetown.edu/publication/ethical-norms-for-new-generation-artificial-intelligence-released/ |archive-date=2023-02-10|translator=Center for Security and Emerging Technology}}

Also in September 2021, the UK published its 10-year National AI Strategy,{{Cite news|url=https://www.theregister.com/2021/09/22/uk_10_year_national_ai_strategy/|title=UK publishes National Artificial Intelligence Strategy|work=The Register|first=Tim|last=Richardson|date=22 September 2021|access-date=November 14, 2021|archive-date=February 10, 2023|archive-url=https://web.archive.org/web/20230210114137/https://www.theregister.com/2021/09/22/uk_10_year_national_ai_strategy/|url-status=live}} which says the British government "takes the long term risk of non-aligned Artificial General Intelligence, and the unforeseeable changes that it would mean for ... the world, seriously".{{cite web|quote=The government takes the long term risk of non-aligned Artificial General Intelligence, and the unforeseeable changes that it would mean for the UK and the world, seriously.|url=https://www.gov.uk/government/publications/national-ai-strategy/national-ai-strategy-html-version |title=The National AI Strategy of the UK|archive-url=https://web.archive.org/web/20230210114139/https://www.gov.uk/government/publications/national-ai-strategy/national-ai-strategy-html-version |archive-date=February 10, 2023 |year=2021}} The strategy describes actions to assess long-term AI risks, including catastrophic risks.{{cite web|at=actions 9 and 10 of the section "Pillar 3 – Governing AI Effectively"|url=https://www.gov.uk/government/publications/national-ai-strategy/national-ai-strategy-html-version |title=The National AI Strategy of the UK|archive-url=https://web.archive.org/web/20230210114139/https://www.gov.uk/government/publications/national-ai-strategy/national-ai-strategy-html-version |archive-date=February 10, 2023 |year=2021}}

In March 2021, the US National Security Commission on Artificial Intelligence said: "Advances in AI ... could lead to inflection points or leaps in capabilities. Such advances may also introduce new concerns and risks and the need for new policies, recommendations, and technical advances to ensure that systems are aligned with goals and values, including safety, robustness, and trustworthiness. The US should ... ensure that AI systems and their uses align with our goals and values."{{Cite book |url=https://www.nscai.gov/wp-content/uploads/2021/03/Full-Report-Digital-1.pdf |title=NSCAI Final Report |publisher=The National Security Commission on Artificial Intelligence |year=2021 |location=Washington, DC |access-date=October 17, 2022 |archive-date=February 15, 2023 |archive-url=https://web.archive.org/web/20230215110858/https://www.nscai.gov/wp-content/uploads/2021/03/Full-Report-Digital-1.pdf |url-status=live }}

In the European Union, AIs must align with substantive equality to comply with EU non-discrimination law{{cite arXiv | eprint=2311.03900 | author1=Robert Lee Poe | title=Why Fair Automated Hiring Systems Breach EU Non-Discrimination Law | date=2023 | class=cs.CY }} and the Court of Justice of the European Union.{{cite journal | url=https://doi.org/10.1177/1358229120927947 | doi=10.1177/1358229120927947 | title=The European Court of Justice and the march towards substantive equality in European Union anti-discrimination law | date=2020 | last1=De Vos | first1=Marc | journal=International Journal of Discrimination and the Law | volume=20 | pages=62–87 }} But the EU has yet to specify with technical rigor how it would evaluate whether AIs are aligned or in compliance.{{Citation needed|date=August 2024}}

Dynamic nature of alignment

AI alignment is often perceived as a fixed objective, but some researchers argue it would be more appropriate to view alignment as an evolving process.{{cite journal |title=Chern number in Ising models with spatially modulated real and complex fields |first1=Geoffrey |last1=Irving |first2=Amanda |last2=Askell |journal=Physical Review A |date=June 9, 2016 |volume=94 |issue=5 |page=052113 |doi=10.1103/PhysRevA.94.052113 |arxiv=1606.03535 |bibcode=2016PhRvA..94e2113L |s2cid=118699363 }} One view is that AI technologies advance and human values and preferences change, alignment solutions must also adapt dynamically. Another is that alignment solutions need not adapt if researchers can create intent-aligned AI: AI that changes its behavior automatically as human intent changes.{{cite arXiv |last1=Mitelut |first1=Catalin |title=Intent-aligned AI systems deplete human agency: the need for agency foundations research in AI safety |date=2023-05-30 |eprint=2305.19223 |last2=Smith |first2=Ben |last3=Vamplew |first3=Peter|class=cs.AI }} The first view would have several implications:

AI alignment solutions require continuous updating in response to AI advancements. A static, one-time alignment approach may not suffice.{{cite journal |title=Artificial Intelligence, Values, and Alignment |first=Iason |last=Gabriel |date=September 1, 2020 |journal=Minds and Machines|volume=30 |issue=3 |pages=411–437 |doi=10.1007/s11023-020-09539-2 |s2cid=210920551 |doi-access=free |arxiv=2001.09768 }}

Varying historical contexts and technological landscapes may necessitate distinct alignment strategies. This calls for a flexible approach and responsiveness to changing conditions.{{cite book |url=https://www.penguinrandomhouse.com/books/566677/human-compatible-by-stuart-russell/ |title=Human Compatible: Artificial Intelligence and the Problem of Control |first=Stuart J. |last=Russell |date=2019 |publisher=Penguin Random House}}

The feasibility of a permanent, "fixed" alignment solution remains uncertain. This raises the potential need for continuous oversight of the AI-human relationship.{{cite journal |url=https://www.nature.com/articles/s41586-019-1420-6 |title=AI policy: A roadmap |first=Allan |last=Dafoe |date=2019 |journal=Nature}}

AI developers may have to continuously refine their ethical frameworks to ensure that their systems align with evolving human values.

In essence, AI alignment may not be a static destination but rather an open, flexible process. Alignment solutions that continually adapt to ethical considerations may offer the most robust approach. This perspective could guide both effective policy-making and technical research in AI.

Footnotes

References

External links

[https://docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml Specification gaming examples in AI], via [https://deepmind.com/blog/article/Specification-gaming-the-flip-side-of-AI-ingenuity DeepMind]

Category:AI safety

Category:Existential risk from artificial general intelligence

Category:Singularitarianism

Category:Philosophy of artificial intelligence

Category:Computational neuroscience

Category:Cybernetics

Category:Artificial intelligence

Category:Articles containing video clips