imitation learning

{{Short description|Machine learning technique where agents learn from demonstrations}}

Imitation learning is a paradigm in reinforcement learning, where an agent learns to perform a task by supervised learning from expert demonstrations. It is also called learning from demonstration and apprenticeship learning.{{Cite book |last1=Russell |first1=Stuart J. |title=Artificial intelligence: a modern approach |last2=Norvig |first2=Peter |date=2021 |publisher=Pearson |isbn=978-0-13-461099-3 |edition=Fourth |series=Pearson series in artificial intelligence |location=Hoboken |chapter=22.6 Apprenticeship and Inverse Reinforcement Learning}}{{Cite book |last1=Sutton |first1=Richard S. |title=Reinforcement learning: an introduction |last2=Barto |first2=Andrew G. |date=2018 |publisher=The MIT Press |isbn=978-0-262-03924-6 |edition=Second |series=Adaptive computation and machine learning series |location=Cambridge, Massachusetts |pages=470}}{{Cite journal |last1=Hussein |first1=Ahmed |last2=Gaber |first2=Mohamed Medhat |last3=Elyan |first3=Eyad |last4=Jayne |first4=Chrisina |date=2017-04-06 |title=Imitation Learning: A Survey of Learning Methods |url=https://doi.org/10.1145/3054912 |journal=ACM Comput. Surv. |volume=50 |issue=2 |pages=21:1–21:35 |doi=10.1145/3054912 |hdl=10059/2298 |issn=0360-0300|hdl-access=free }}

It has been applied to underactuated robotics,{{Cite web |title=Ch. 21 - Imitation Learning |url=https://underactuated.mit.edu/imitation.html |access-date=2024-08-08 |website=underactuated.mit.edu}} self-driving cars,{{Cite journal |last=Pomerleau |first=Dean A. |date=1988 |title=ALVINN: An Autonomous Land Vehicle in a Neural Network |url=https://proceedings.neurips.cc/paper/1988/hash/812b4ba287f5ee0bc9d43bbf5bbe87fb-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Morgan-Kaufmann |volume=1}}{{Cite arXiv |last1=Bojarski |first1=Mariusz |last2=Del Testa |first2=Davide |last3=Dworakowski |first3=Daniel |last4=Firner |first4=Bernhard |last5=Flepp |first5=Beat |last6=Goyal |first6=Prasoon |last7=Jackel |first7=Lawrence D. |last8=Monfort |first8=Mathew |last9=Muller |first9=Urs |date=2016-04-25 |title=End to End Learning for Self-Driving Cars |class=cs.CV |eprint=1604.07316v1 |language=en}}{{Cite journal |last1=Kiran |first1=B Ravi |last2=Sobh |first2=Ibrahim |last3=Talpaert |first3=Victor |last4=Mannion |first4=Patrick |last5=Sallab |first5=Ahmad A. Al |last6=Yogamani |first6=Senthil |last7=Perez |first7=Patrick |date=June 2022 |title=Deep Reinforcement Learning for Autonomous Driving: A Survey |url=https://ieeexplore.ieee.org/document/9351818 |journal=IEEE Transactions on Intelligent Transportation Systems |volume=23 |issue=6 |pages=4909–4926 |doi=10.1109/TITS.2021.3054625 |arxiv=2002.00444 |issn=1524-9050}} quadcopter navigation,{{Cite journal |last1=Giusti |first1=Alessandro |last2=Guzzi |first2=Jerome |last3=Ciresan |first3=Dan C. |last4=He |first4=Fang-Lin |last5=Rodriguez |first5=Juan P. |last6=Fontana |first6=Flavio |last7=Faessler |first7=Matthias |last8=Forster |first8=Christian |last9=Schmidhuber |first9=Jurgen |last10=Caro |first10=Gianni Di |last11=Scaramuzza |first11=Davide |last12=Gambardella |first12=Luca M. |date=July 2016 |title=A Machine Learning Approach to Visual Perception of Forest Trails for Mobile Robots |url=https://ieeexplore.ieee.org/document/7358076 |journal=IEEE Robotics and Automation Letters |volume=1 |issue=2 |pages=661–667 |doi=10.1109/LRA.2015.2509024 |issn=2377-3766}} helicopter aerobatics,{{Cite web |title=Autonomous Helicopter: Stanford University AI Lab |url=http://heli.stanford.edu/ |access-date=2024-08-08 |website=heli.stanford.edu}} and locomotion.{{Cite journal |last1=Nakanishi |first1=Jun |last2=Morimoto |first2=Jun |last3=Endo |first3=Gen |last4=Cheng |first4=Gordon |last5=Schaal |first5=Stefan |last6=Kawato |first6=Mitsuo |date=June 2004 |title=Learning from demonstration and adaptation of biped locomotion |url=https://linkinghub.elsevier.com/retrieve/pii/S0921889004000399 |journal=Robotics and Autonomous Systems |language=en |volume=47 |issue=2–3 |pages=79–91 |doi=10.1016/j.robot.2004.03.003}}{{Cite book |last1=Kalakrishnan |first1=Mrinal |last2=Buchli |first2=Jonas |last3=Pastor |first3=Peter |last4=Schaal |first4=Stefan |chapter=Learning locomotion over rough terrain using terrain templates |date=October 2009 |pages=167–172 |title=2009 IEEE/RSJ International Conference on Intelligent Robots and Systems |chapter-url=http://dx.doi.org/10.1109/iros.2009.5354701 |publisher=IEEE |doi=10.1109/iros.2009.5354701|isbn=978-1-4244-3803-7 }}

Approaches

Expert demonstrations are recordings of an expert performing the desired task, often collected as state-action pairs $(o_t^*, a_t^*)$ .

= Behavior Cloning =

Behavior Cloning (BC) is the most basic form of imitation learning. Essentially, it uses supervised learning to train a policy $\pi_\theta$ such that, given an observation $o_t$ , it would output an action distribution $\pi_\theta(\cdot | o_t)$ that is approximately the same as the action distribution of the experts.[https://rail.eecs.berkeley.edu/deeprlcourse/deeprlcourse/static/slides/lec-2.pdf CS 285 at UC Berkeley: Deep Reinforcement Learning. Lecture 2: Supervised Learning of Behaviors]

BC is susceptible to distribution shift. Specifically, if the trained policy differs from the expert policy, it might find itself straying from expert trajectory into observations that would have never occurred in expert trajectories.

This was already noted by ALVINN, where they trained a neural network to drive a van using human demonstrations. They noticed that because a human driver never strays far from the path, the network would never be trained on what action to take if it ever finds itself straying far from the path.

= DAgger =

Dagger (Dataset Aggregation){{Cite journal |last1=Ross |first1=Stephane |last2=Gordon |first2=Geoffrey |last3=Bagnell |first3=Drew |date=2011-06-14 |title=A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning |url=https://proceedings.mlr.press/v15/ross11a.html |journal=Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics |language=en |publisher=JMLR Workshop and Conference Proceedings |pages=627–635}} improves on behavior cloning by iteratively training on a dataset of expert demonstrations. In each iteration, the algorithm first collects data by rolling out the learned policy $\pi_\theta$ . Then, it queries the expert for the optimal action $a_t^*$ on each observation $o_t$ encountered during the rollout. Finally, it aggregates the new data into the dataset $D \leftarrow D \cup \{ (o_1, a_1^*), (o_2, a_2^*), ..., (o_T, a_T^*) \}$ and trains a new policy on the aggregated dataset.

= Decision transformer =

File:Decision_Transformer_architecture.png

The Decision Transformer approach models reinforcement learning as a sequence modelling problem.{{Cite journal |last1=Chen |first1=Lili |last2=Lu |first2=Kevin |last3=Rajeswaran |first3=Aravind |last4=Lee |first4=Kimin |last5=Grover |first5=Aditya |last6=Laskin |first6=Misha |last7=Abbeel |first7=Pieter |last8=Srinivas |first8=Aravind |last9=Mordatch |first9=Igor |date=2021 |title=Decision Transformer: Reinforcement Learning via Sequence Modeling |url=https://proceedings.neurips.cc/paper/2021/hash/7f489f642a0ddb10272b5c31057f0663-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=34 |pages=15084–15097|arxiv=2106.01345 }} Similar to Behavior Cloning, it trains a sequence model, such as a Transformer, that models rollout sequences $(R_1, o_1, a_1), (R_2, o_2, a_2), \dots, (R_t, o_t, a_t),$ where $R_t = r_t + r_{t+1} + \dots + r_T$ is the sum of future reward in the rollout. During training time, the sequence model is trained to predict each action $a_t$ , given the previous rollout as context: $(R_1, o_1, a_1), (R_2, o_2, a_2), \dots, (R_t, o_t)$ During inference time, to use the sequence model as an effective controller, it is simply given a very high reward prediction $R$ , and it would generalize by predicting an action that would result in the high reward. This was shown to scale predictably to a Transformer with 1 billion parameters that is superhuman on 41 Atari games.{{Citation |last1=Lee |first1=Kuang-Huei |title=Multi-Game Decision Transformers |date=2022-10-15 |url=https://arxiv.org/abs/2205.15241 |access-date=2024-10-22 |arxiv=2205.15241 |last2=Nachum |first2=Ofir |last3=Yang |first3=Mengjiao |last4=Lee |first4=Lisa |last5=Freeman |first5=Daniel |last6=Xu |first6=Winnie |last7=Guadarrama |first7=Sergio |last8=Fischer |first8=Ian |last9=Jang |first9=Eric}}

= Other approaches =

See {{Cite arXiv |last1=Hester |first1=Todd |last2=Vecerik |first2=Matej |last3=Pietquin |first3=Olivier |last4=Lanctot |first4=Marc |last5=Schaul |first5=Tom |last6=Piot |first6=Bilal |last7=Horgan |first7=Dan |last8=Quan |first8=John |last9=Sendonaris |first9=Andrew |date=2017-04-12 |title=Deep Q-learning from Demonstrations |class=cs.AI |eprint=1704.03732v4 |language=en}}{{Cite journal |last1=Duan |first1=Yan |last2=Andrychowicz |first2=Marcin |last3=Stadie |first3=Bradly |last4=Jonathan Ho |first4=OpenAI |last5=Schneider |first5=Jonas |last6=Sutskever |first6=Ilya |last7=Abbeel |first7=Pieter |last8=Zaremba |first8=Wojciech |date=2017 |title=One-Shot Imitation Learning |url=https://proceedings.neurips.cc/paper_files/paper/2017/hash/ba3866600c3540f67c1e9575e213be0a-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=30}} for more examples.

Related approaches

Inverse Reinforcement Learning (IRL) learns a reward function that explains the expert's behavior and then uses reinforcement learning to find a policy that maximizes this reward.{{Cite journal |last=A |first=Ng |date=2000 |title=Algorithms for Inverse Reinforcement Learning |url=https://cir.nii.ac.jp/crid/1570854174323564160 |journal=Proc. Of 17th International Conference on Machine Learning, 2000 |pages=663–670}} Recent works have also explored multi-agent extensions of IRL in networked systems.V. S. Donge, B. Lian, F. L. Lewis and A. Davoudi, "Multiagent Graphical Games With Inverse Reinforcement Learning," in IEEE Transactions on Control of Network Systems, vol. 10, no. 2, pp. 841-852, June 2023, doi:10.1109/TCNS.2022.3210856.

Generative Adversarial Imitation Learning (GAIL) uses generative adversarial networks (GANs) to match the distribution of agent behavior to the distribution of expert demonstrations.{{Cite journal |last1=Ho |first1=Jonathan |last2=Ermon |first2=Stefano |date=2016 |title=Generative Adversarial Imitation Learning |url=https://proceedings.neurips.cc/paper_files/paper/2016/hash/cc7e2b878868cbae992d1fb743995d8f-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=29|arxiv=1606.03476 }} It extends a previous approach using game theory.{{Cite journal |last1=Syed |first1=Umar |last2=Schapire |first2=Robert E |date=2007 |title=A Game-Theoretic Approach to Apprenticeship Learning |url=https://proceedings.neurips.cc/paper_files/paper/2007/hash/ca3ec598002d2e7662e2ef4bdd58278b-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=20}}

References

Category:Supervised learning