MuZero

{{update|date=May 2022}}

{{short description|Game-playing artificial intelligence}}

{{Chess programming series}}

MuZero is a computer program developed by artificial intelligence research company DeepMind to master games without knowing their rules.{{cite web|last1=Wiggers |first1=Kyle |title=DeepMind's MuZero teaches itself how to win at Atari, chess, shogi, and Go |date=20 November 2019 |url=https://venturebeat.com/2019/11/20/deepminds-muzero-teaches-itself-how-to-win-at-atari-chess-shogi-and-go/ |access-date=22 July 2020 |publisher=VentureBeat}}{{cite news |last1=Friedel |first1=Frederic |title=MuZero figures out chess, rules and all |url=https://en.chessbase.com/post/muzero-figures-out-chess-rules-and-all |access-date=22 July 2020 |publisher=ChessBase GmbH}}{{cite news |last1=Rodriguez |first1=Jesus |title=DeepMind Unveils MuZero, a New Agent that Mastered Chess, Shogi, Atari and Go Without Knowing the Rules |url=https://www.kdnuggets.com/2019/12/deepmind-unveils-muzero-agent-chess-shogi-atari-go.html |access-date=22 July 2020 |work=KDnuggets}} Its release in 2019 included benchmarks of its performance in go, chess, shogi, and a standard suite of Atari games. The algorithm uses an approach similar to AlphaZero. It matched AlphaZero's performance in chess and shogi, improved on its performance in Go (setting a new world record), and improved on the state of the art in mastering a suite of 57 Atari games (the Arcade Learning Environment), a visually-complex domain.

MuZero was trained via self-play, with no access to rules, opening books, or endgame tablebases. The trained algorithm used the same convolutional and residual architecture as AlphaZero, but with 20 percent fewer computation steps per node in the search tree.{{cite journal|arxiv=1911.08265|first1=Julian|last1=Schrittwieser|first2=Ioannis|last2=Antonoglou|title=Mastering Atari, Go, chess and shogi by planning with a learned model|last3=Hubert|last9=Hassabis|first11=Timothy|last11=Lillicrap|first10=Thore|last10=Graepel|first9=Demis|first8=Edward|first3=Thomas|last8=Lockhart|last7=Guez|first6=Simon|last6=Schmitt|first5=Laurent|last5=Sifre|first4=Karen|last4=Simonyan|first7=Arthur|journal=Nature|year=2020|volume=588|issue=7839|pages=604–609|doi=10.1038/s41586-020-03051-4|pmid=33361790|bibcode=2020Natur.588..604S|s2cid=208158225}}

MuZero’s capacity to plan and learn effectively without explicit rules makes it a groundbreaking achievement in reinforcement learning and AI, pushing the boundaries of what is possible in artificial intelligence.

{{Toclimit|3}}

History

{{Cquote

| quote = MuZero really is discovering for itself how to build a model and understand it just from first principles.

| author = David Silver, DeepMind

| source = Wired{{Cite magazine|title=What AlphaGo Can Teach Us About How People Learn|language=en-us|magazine=Wired|url=https://www.wired.com/story/what-alphago-teach-how-people-learn/|access-date=2020-12-25|issn=1059-1028}}

| float = right

}}

On November 19, 2019, the DeepMind team released a preprint introducing MuZero.

= Derivation from AlphaZero =

{{Further|AlphaZero}}MuZero (MZ) is a combination of the high-performance planning of the AlphaZero (AZ) algorithm with approaches to model-free reinforcement learning. The combination allows for more efficient training in classical planning regimes, such as Go, while also handling domains with much more complex inputs at each stage, such as visual video games.

MuZero was derived directly from AZ code, sharing its rules for setting hyperparameters. Differences between the approaches include:{{Cite arXiv |eprint=1712.01815 |class=cs.AI|first1=David|last1=Silver|first2=Thomas|last2=Hubert|author-link1=David Silver (programmer)|title=Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm|date=5 December 2017 |first5=Matthew |first11=Timothy |first3=Julian |last3=Schrittwieser |first4=Ioannis |author-link13=Demis Hassabis |last13=Hassabis |first13=Demis |last12=Simonyan |first12=Karen |last11=Lillicrap |last10=Graepel |last5=Lai |first10=Thore |author-link9=Dharshan Kumaran |last9=Kumaran |first9=Dharshan |last4=Antonoglou |first8=Laurent |last7=Lanctot |first7=Marc |last6=Guez |first6=Arthur |last8=Sifre}}

  • AZ's planning process uses a simulator. The simulator knows the rules of the game. It has to be explicitly programmed. A neural network then predicts the policy and value of a future position. Perfect knowledge of game rules is used in modeling state transitions in the search tree, actions available at each node, and termination of a branch of the tree. MZ does not have access to the rules, and instead learns one with neural networks.
  • AZ has a single model for the game (from board state to predictions); MZ has separate models for representation of the current state (from board state into its internal embedding), dynamics of states (how actions change representations of board states), and prediction of policy and value of a future position (given a state's representation).
  • MZ's hidden model may be complex, and it may turn out it can host computation; exploring the details of the hidden model in a trained instance of MZ is a topic for future exploration.
  • MZ does not expect a two-player game where winners take all. It works with standard reinforcement-learning scenarios, including single-agent environments with continuous intermediate rewards, possibly of arbitrary magnitude and with time discounting. AZ was designed for two-player games that could be won, drawn, or lost.

= Comparison with R2D2 =

The previous state of the art technique for learning to play the suite of Atari games was R2D2, the Recurrent Replay Distributed DQN.{{Cite conference |last1=Kapturowski |first1=Steven |last2=Ostrovski |first2=Georg |last3=Quan |first3=John |last4=Munos |first4=Remi |last5=Dabney |first5=Will |title=RECURRENT EXPERIENCE REPLAY IN DISTRIBUTED REINFORCEMENT LEARNING |conference=ICLR 2019 |url=https://openreview.net/pdf?id=r1lyTjAqYX |via=Open Review}}

MuZero surpassed both R2D2's mean and median performance across the suite of games, though it did not do better in every game.

Training and results

MuZero used 16 third-generation tensor processing units (TPUs) for training, and 1000 TPUs for selfplay for board games, with 800 simulations per step and 8 TPUs for training and 32 TPUs for selfplay for Atari games, with 50 simulations per step.

AlphaZero used 64 second-generation TPUs for training, and 5000 first-generation TPUs for selfplay. As TPU design has improved (third-generation chips are 2x as powerful individually as second-generation chips, with further advances in bandwidth and networking across chips in a pod), these are comparable training setups.

R2D2 was trained for 5 days through 2M training steps.

= Initial results =

MuZero matched AlphaZero's performance in chess and Shogi after roughly 1 million training steps. It matched AZ's performance in Go after 500,000 training steps and surpassed it by 1 million steps. It matched R2D2's mean and median performance across the Atari game suite after 500 thousand training steps and surpassed it by 1 million steps, though it never performed well on 6 games in the suite.

See also

References

{{reflist}}