Trust Region Policy Optimization, or TRPO, is a policy gradient method in reinforcement learning that avoids parameter updates that change the policy too much with a KL divergence constraint on the size of the policy update at each iteration.
Take the case of off-policy reinforcement learning, where the policy β for collecting trajectories on rollout workers is different from the policy π to optimize for. The objective function in an off-policy model measures the total advantage over the state visitation distribution and actions, while the mismatch between the training data distribution and the true policy state distribution is compensated with an importance sampling estimator:
$$
J\left(\theta\right) = \sum_{s\in{S}}p^{\pi_{\theta_{old}}}\sum_{a\in\mathcal{A}}\left(\beta\left(a\mid{s}\right)\frac{\pi_{\theta}\left(a\mid{s}\right)}{\beta\left(a\mid{s}\right)}\hat{A}{\theta{old}}\left(s, a\right)\right)
$$
$$
J\left(\theta\right) = \sum_{s\in{S}}p^{\pi_{\theta_{old}}}\sum_{a\in\mathcal{A}}\left(\beta\left(a\mid{s}\right)\frac{\pi_{\theta}\left(a\mid{s}\right)}{\beta\left(a\mid{s}\right)}\hat{A}{\theta{old}}\left(s, a\right)\right)
$$
$$
J\left(\theta\right) = \mathbb{E}{s\sim{p}^{\pi{\theta_{old}}}, a\sim{\beta}} \left(\frac{\pi_{\theta}\left(a\mid{s}\right)}{\beta\left(a\mid{s}\right)}\hat{A}{\theta{old}}\left(s, a\right)\right)
$$
When training on policy, theoretically the policy for collecting data is same as the policy that we want to optimize. However, when rollout workers and optimizers are running in parallel asynchronously, the behavior policy can get stale. TRPO considers this subtle difference: It labels the behavior policy as πθold(a∣s) and thus the objective function becomes:
$$
J\left(\theta\right) = \mathbb{E}{s\sim{p}^{\pi{\theta_{old}}}, a\sim{\pi_{\theta_{old}}}} \left(\frac{\pi_{\theta}\left(a\mid{s}\right)}{\pi_{\theta_{old}}\left(a\mid{s}\right)}\hat{A}{\theta{old}}\left(s, a\right)\right)
$$
TRPO aims to maximize the objective function J(θ) subject to a trust region constraint which enforces the distance between old and new policies measured by KL-divergence to be small enough, within a parameter δ:
$$
\mathbb{E}{s\sim{p}^{\pi{\theta_{old}}}} \left[D_{KL}\left(\pi_{\theta_{old}}\left(.\mid{s}\right)\mid\mid\pi_{\theta}\left(.\mid{s}\right)\right)\right] \leq \delta
$$
https://zhuanlan.zhihu.com/p/384334291
https://blog.csdn.net/qq_43616565/article/details/121090957
https://zhuanlan.zhihu.com/p/331850355?utm_source=wechat_session
https://zhuanlan.zhihu.com/p/114866455
https://zhuanlan.zhihu.com/p/26308073