proximal policy optimization

{{Short description|Model-free reinforcement learning algorithm}}

{{Machine learning|Reinforcement learning}}

{{More citations needed|date=October 2022|reason=Developments may have been published since 2017}}

Proximal policy optimization (PPO) is a reinforcement learning (RL) algorithm for training an intelligent agent. Specifically, it is a policy gradient method, often used for deep RL when the policy network is very large.

History

The predecessor to PPO, Trust Region Policy Optimization (TRPO), was published in 2015. It addressed the instability issue of another algorithm, the Deep Q-Network (DQN), by using the trust region method to limit the KL divergence between the old and new policies. However, TRPO uses the Hessian matrix (a matrix of second derivatives) to enforce the trust region, but the Hessian is inefficient for large-scale problems.{{Cite journal |last1=Schulman |first1=John |last2=Levine |first2=Sergey |last3=Moritz |first3=Philipp |last4=Jordan |first4=Michael |last5=Abbeel |first5=Pieter |date=2015-07-06 |title=Trust region policy optimization |url=https://dl.acm.org/doi/10.5555/3045118.3045319 |journal=Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 |series=ICML'15 |location=Lille, France |publisher=JMLR.org |pages=1889–1897 }}

PPO was published in 2017. It was essentially an approximation of TRPO that does not require computing the Hessian. The KL divergence constraint was approximated by simply clipping the policy gradient.{{Citation |last1=Schulman |first1=John |title=Proximal Policy Optimization Algorithms |date=2017-08-28 |arxiv=1707.06347 |last2=Wolski |first2=Filip |last3=Dhariwal |first3=Prafulla |last4=Radford |first4=Alec |last5=Klimov |first5=Oleg}}

Since 2018, PPO was the default RL algorithm at OpenAI.OpenAI, "Proximal Policy Optimization" Available at: https://openai.com/research/openai-baselines-ppo PPO has been applied to many areas, such as controlling a robotic arm, beating professional players at Dota 2 (OpenAI Five), and playing Atari games.Arxiv Insights. "An introduction to Policy Gradient methods," YouTube, Oct 1st, 2018 [Video file]. Available: https://www.youtube.com/watch?v=5P7I-xPq8u8

TRPO

TRPO, the predecessor of PPO, is an on-policy algorithm. It can be used for environments with either discrete or continuous action spaces.

The pseudocode is as follows:{{Cite web |title=Trust Region Policy Optimization — Spinning Up documentation |url=https://spinningup.openai.com/en/latest/algorithms/trpo.html |access-date=2025-01-21 |website=spinningup.openai.com}}

  • Input: initial policy parameters \theta_0, initial value function parameters \phi_0
  • Hyperparameters: KL-divergence limit \delta, backtracking coefficient \alpha, maximum number of backtracking steps K
  • for k=0,1,2, \ldots do
  • Collect set of trajectories \mathcal{D}_k=\left\{\tau_i\right\} by running policy \pi_k=\pi\left(\theta_k\right) in the environment.
  • Compute rewards-to-go{{What|date=March 2025}} \hat{R}_t.
  • Compute advantage{{What|date=March 2025}} estimates, \hat{A}_t (using any method of advantage estimation) based on the current value function V_{\phi_k}.
  • Estimate policy gradient as

\hat{g}_k=\left.\frac{1}{\left|\mathcal{D}_k\right|} \sum_{\tau \in \mathcal{D}_k} \sum_{t=0}^T \nabla_\theta \log \pi_\theta\left(a_t \mid s_t\right)\right|_{\theta_k} \hat{A}_t

\hat{x}_k \approx \hat{H}_k^{-1} \hat{g}_k

where \hat{H}_k is the Hessian of the sample average KL-divergence.

  • Update the policy by backtracking line search with

\theta_{k+1}=\theta_k+\alpha^j \sqrt{\frac{2 \delta}{\hat{x}_k^T \hat{H}_k \hat{x}_k}} \hat{x}_k

where j \in\{0,1,2, \ldots K\} is the smallest value which improves the sample loss and satisfies the sample KL-divergence constraint.

  • Fit value function by regression on mean-squared error:

\phi_{k+1}=\arg \min _\phi \frac{1}{\left|\mathcal{D}_k\right| T} \sum_{\tau \in \mathcal{D}_k} \sum_{t=0}^T\left(V_\phi\left(s_t\right)-\hat{R}_t\right)^2

typically via some gradient descent algorithm.

PPO

The pseudocode is as follows:{{Cite web |title=Proximal Policy Optimization — Spinning Up documentation |url=https://spinningup.openai.com/en/latest/algorithms/ppo.html |access-date=2025-01-21 |website=spinningup.openai.com}}

  • Input: initial policy parameters \theta_0, initial value function parameters \phi_0
  • for k=0,1,2, \ldots do
  • Collect set of trajectories \mathcal{D}_k=\left\{\tau_i\right\} by running policy \pi_k=\pi\left(\theta_k\right) in the environment.
  • Compute rewards-to-go \hat{R}_t.
  • Compute advantage estimates, \hat{A}_t (using any method of advantage estimation) based on the current value function V_{\phi_k}.
  • Update the policy by maximizing the PPO-Clip objective:

\theta_{k+1}=\arg \max _\theta \frac{1}{\left|\mathcal{D}_k\right| T} \sum_{\tau \in \mathcal{D}_k} \sum_{t=0}^T \min \left(\frac{\pi_\theta\left(a_t \mid s_t\right)}{\pi_{\theta_k}\left(a_t \mid s_t\right)} A^{\pi_{\theta_k}}\left(s_t, a_t\right), \quad g\left(\epsilon, A^{\pi_{\theta_k}}\left(s_t, a_t\right)\right)\right)

typically via stochastic gradient ascent with Adam.

  • Fit value function by regression on mean-squared error:

\phi_{k+1}=\arg \min _\phi \frac{1}{\left|\mathcal{D}_k\right| T} \sum_{\tau \in \mathcal{D}_k} \sum_{t=0}^T\left(V_\phi\left(s_t\right)-\hat{R}_t\right)^2

typically via some gradient descent algorithm.

Like all policy gradient methods, PPO is used for training an RL agent whose actions are determined by a differentiable policy function by gradient ascent.

Intuitively, a policy gradient method takes small policy update steps, so the agent can reach higher and higher rewards in expectation.

Policy gradient methods may be unstable: A step size that is too big may direct the policy in a suboptimal direction, thus having little possibility of recovery; a step size that is too small lowers the overall efficiency.

To solve the instability, PPO implements a clip function that constrains the policy update of an agent from being too large, so that larger step sizes may be used without negatively affecting the gradient ascent process.T. Simonini, "Proximal Policy Optimization (PPO)," Hugging Face – The AI community building the future., https://huggingface.co/blog/deep-rl-ppo

= Basic concepts =

To begin the PPO training process, the agent is set in an environment to perform actions based on its current input. In the early phase of training, the agent can freely explore solutions and keep track of the result. Later, with a certain amount of transition samples and policy updates, the agent will select an action to take by randomly sampling from the probability distribution P(A|S) generated by the policy network."A Beginner's Guide to deep Reinforcement learning," Pathmind. https://wiki.pathmind.com/deep-reinforcement-learning#reward The actions that are most likely to be beneficial will have the highest probability of being selected from the random sample. After an agent arrives at a different scenario (a new state) by acting, it is rewarded with a positive reward or a negative reward. The objective of an agent is to maximize the cumulative reward signal across sequences of states, known as episodes.

= Policy gradient laws: the advantage function =

The advantage function (denoted as A) is central to PPO, as it tries to answer the question of whether a specific action of the agent is better or worse than some other possible action in a given state. By definition, the advantage function is an estimate of the relative value for a selected action. If the output of this function is positive, it means that the action in question is better than the average return, so the possibilities of selecting that specific action will increase. The opposite is true for a negative advantage output.

The advantage function can be defined as A=Q-V, where Q is the discounted sum of rewards (the total weighted reward for the completion of an episode) and V is the baseline estimate.OpenAI, "Part 1: Key concepts in RL," Part 1: Key Concepts in RL - Spinning Up documentation, https://spinningup.openai.com/en/latest/spinningup/rl_intro.html Since the advantage function is calculated after the completion of an episode, the program records the outcome of the episode. Therefore, calculating advantage is essentially an unsupervised learning problem. The baseline estimate comes from the value function that outputs the expected discounted sum of an episode starting from the current state. In the PPO algorithm, the baseline estimate will be noisy (with some variance), as it also uses a neural network, like the policy function itself. With Q and V computed, the advantage function is calculated by subtracting the baseline estimate from the actual discounted return.Rohitkumar, "PPO (Proximal Policy Optimization) explained with code examples in PyTorch and TensorFlow," PlainSwipe, https://plainswipe.com/ppo-proximal-policy-optimization-explained-with-code-examples-in-pytorch-and-tensorflow/ If A>0, the actual return of the action is better than the expected return from experience; if A<0, the actual return is worse.

= Ratio function =

In PPO, the ratio function (r_t) calculates the probability of selecting action a in state s given the current policy network, divided by the previous probability under the old policy. In other words:

  • If r_t(\theta) > 1, where \theta are the policy network parameters, then selecting action a in state s is more likely based on the current policy than the previous policy.
  • If 0 \le r_t(\theta) < 1, then selecting action a in state s is less likely based on the current policy than the old policy.

Hence, this ratio function can easily estimate the divergence between old and current policies.W. Heeswijk, "Proximal Policy Optimization (PPO) explained," Medium, https://towardsdatascience.com/proximal-policy-optimization-ppo-explained-abed1952457b

= PPO objective function =

The objective function of PPO takes the expectation operator (denoted as E) which means that this function will be computed over quantities of trajectories. The expectation operator takes the minimum of two terms:

  1. r_t(\theta) \cdot A: this is the product of the ratio function and the advantage function introduced in TRPO, also known as the normal policy gradient objective.Edan Meyer. "Proximal Policy Optimization Explained," YouTube, May 20th, 2021 [Video file]. Available: https://www.youtube.com/watch?v=HrapVFNBN64
  2. \operatorname{clip}(r_t(\theta)) \cdot A: the policy ratio is first clipped to the range [1 - \epsilon, 1 + \epsilon]; generally, \epsilon is defined to be 0.2. Then, as before, it is multiplied by the advantage.

The fundamental intuition behind PPO is the same as that of TRPO: conservatism. Clipping results in a conservative advantage estimate of the new policy. The reasoning is that if an agent makes significant changes due to high advantage estimates, its policy update will be large and unstable, and may diverge from the optimal policy with little possibility of recovery.{{cite web|author1=C. M. Wild|title=The Pursuit of (Robotic) Happiness: How TRPO and PPO Stabilize Policy Gradient Methods|url=https://towardsdatascience.com/the-pursuit-of-robotic-happiness-how-trpoand-ppo-stabilize-policy-gradient-methods-545784094e3b|website=towardsdatascience.com|date=Jul 9, 2018}} There are two common applications of the clipping function: when an action under a new policy happens to be a good choice based on the advantage function, the clipping function limits how much credit can be given to a new policy for up-weighted good actions. On the other hand, when an action under the old policy is judged to be bad, the clipping function constrains how much the agent can accept the down-weighted bad actions of the new policy.Zheng, R., Dou, S., Gao, S., Hua, Y., Shen, W., Wang, B.,(2023). Secrets of RLHF in Large Language Models Part I: PPO. ArXiv. /abs/2307.04964 Consequently, the clipping mechanism is designed to discourage the incentive of moving beyond the defined range by clipping both directions. The advantage of this method is that it can be optimized directly with gradient descent, as opposed to the strict KL divergence constraint of TRPO, making the implementation faster and more intuitive.

After computing the clipped surrogate objective function, the agent has two probability ratios: one non-clipped and one clipped. Then, by taking the minimum of the two objectives, the final objective becomes a lower bound (a pessimistic bound) of what the agent knows is possible. In other words, the minimum method makes sure that the agent is doing the safest possible update.

Advantages

= Simplicity =

PPO approximates what TRPO does, with considerably less computation. It uses first-order optimization (the clip function) to constrain the policy update, while TRPO uses KL divergence constraints (second-order optimization). Compared to TRPO, the PPO method is relatively easy to implement and requires less computational resource and time. Therefore, it is cheaper and more efficient to use PPO in large-scale problems.J. Nocedal and Y. Nesterov., "Natural, trust region and proximal policy optimization," TransferLab, https://transferlab.ai/blog/trpo-and-ppo/

= Stability =

While other RL algorithms require hyperparameter tuning, PPO comparatively does not require as much (0.2 for epsilon can be used in most cases).J. Hui, "RL - reinforcement learning algorithms comparison," Medium, https://jonathan-hui.medium.com/rl-reinforcement-learning-algorithms-comparison-76df90f180cf/ Also, PPO does not require sophisticated optimization techniques. It can be easily practiced with standard deep learning frameworks and generalized to a broad range of tasks.

= Sample efficiency =

Sample efficiency indicates whether the algorithms need more or less data to train a good policy. PPO achieved sample efficiency because of its use of surrogate objectives. The surrogate objective allows PPO to avoid the new policy moving too far from the old policy; the clip function regularizes the policy update and reuses training data. Sample efficiency is especially useful for complicated and high-dimensional tasks, where data collection and computation can be costly.XiaoYang-ElegantRL, "ElegantRL: Mastering PPO Algorithms - towards Data Science," Medium, Nov. 23, 2022. [Online]. Available: https://towardsdatascience.com/elegantrl-mastering-the-ppo-algorithm-part-i-9f36bc47b791

See also

References

{{reflist}}