0%

RL items

https://blog.csdn.net/qq_33328642/article/details/123683755

non-stationray,sample efficiency,planning和Learnin,Reward,off-policy和on-policy Infinite horizon finite horizon Regrets

non-stationray:https://stepneverstop.github.io/rl-classification.html

Stationary or not

根据环境十分稳定、可以将强化学习问题分为stationary、non-stationary。

如果状态转移奖励函数是确定的,即选择动作aa后执行它的结果是确定的,那么这个环境就是stationary。

如果状态转移奖励函数是不确定的,即选择动作aa后执行它的结果是不确定的,那么这个环境就是non-stationary。

A stationary (平稳) policy, 𝜋𝑡πt, is a policy that does not change over time, that is, 𝜋𝑡=𝜋,∀𝑡≥0πt=π,∀t≥0, where 𝜋π can either be a function, 𝜋:𝑆→𝐴π:S→A (a deterministic (确定性) policy), or a conditional (条件) density, 𝜋(𝐴∣𝑆)π(A∣S) (a stochastic (随机) policy). A non-stationary policy is a policy that is not stationary (平稳) . More precisely, 𝜋𝑖πi may not be equal to 𝜋𝑗πj, for 𝑖≠𝑗≥0i≠j≥0, where 𝑖i and 𝑗j are thus two different time steps.

强化学习的样本效率sample efficiency:https://blog.csdn.net/wxc971231/article/details/120992949

horizon 这个词在各种强化学习教程里出现的频率不算高,但它也是要了解的一个概念。
先查词典:
n. 地平线;视野;眼界;范围

在强化学习里面,horizon主要取“范围”的含义。也可以理解为一个agent在environment里一步步走下去,在一次交互过程中,总共走过的步数。
举个例子,假设有一个“怎么玩都永远不会死”的游戏(只不过得分有高低罢了),那么把这个玩游戏的过程对应到强化学习领域,它就是一个无限步的概念,即 infinite horizon;反之则是 finite horizon(有限步)的。
在训练强化学习模型的时候,并不是说一个游戏非要玩到挂掉才行,我们也可以限定在一个固定的 horizon 内来计算reward。所以从这个角度来说,horizon 也可以认为是agent的生存时间,当agent的剩余可用步数改变的时候,那么agent的行为可能也就会随之改变。

https://www.guyuehome.com/36204

https://www.92python.com/view/409.html

计算累积奖励有两种方式,一种是计算从当前状态到结束状态的所有奖励值之和:

Gt=rt+1+rt+2+…+rt+T

上面适用于有限时界(Finite-horizon)情况下的强化学习,但是在有些无限时界(Finite-horizon)情况,智能体要执行的可能是一个时间持续很长的任务,比如自动驾驶,如果使用上式计算累积奖励值显然是不合理的。

需要一个有限的值,通常会增加一个折扣因子,如下式:
img

在上式中,0≤γ≤1 。当 γ 的值等于 0 时,则智能体只考虑下一步的回报;当 γ 的值越趋近于 1,未来的奖励就会被越多地考虑在内。需要注意的是,有时候我们会更关心眼下的奖励,有时候则会更关心未来的奖励,调整的方式就是修改 γ 的值。

Regrets

The action you regret the most is the one that should have been (more likely) used or taken. So the probability of taking this action is proportional to how deep you regret you haven’t taken it.

Mathematically speaking, the regret is expressed as the difference between the payoff (reward or return) of a possible action and the payoff of the action that has been actually taken. If we denote the payoff function as *u* the formula becomes:

*regret = u(possible action) - u(action taken)*

Clearly we are interested in cases where the payoff of ‘**possible action*’ outperforms the payoff of the ‘action taken***’, so we consider positive regrets and ignore zero and negative regrets.

As said earlier the probability of using an action other than the one actually used is proportional to the regret it generates.

For example if we took action a1 and got u(a1) = 1, then we computed u(a2)= 2, u(a3) = 4, u(a4) = 7. The respective regrets will be regret(a2) = u(a2) - u(a1) = 1, same for regret(a3) = 3, and regret(a4) = 6.
Total regrets is regret(a1) + regret(a2) + regret(a3) + regret(a4) = 0 + 1 + 3 + 6 = 10.

It is easy to see that the most regretted action is a4. To reflect this numerically, we update our strategy, denoted as σ, such as σ(a2) = 1/10 = .1, σ(a3) = 3/10 = .3, σ(a4) = 6/10 = .6.

Obviously, you might be asking, why not explicitly give action a4 a probability of 1 (σ(a4) = 1)? Simply because the notion of regret is used when facing another actor, such as in games. Playing in a deterministic manner in a game will give your opponent a chance to counter measure your strategy and win.

Improving Sample Efficiency In Model-Free Reinforcement Learning From Imageshttps://blog.csdn.net/sufail/article/details/104889591

事物进行感知、记忆、思考,叫做认知

认知进行感知、记忆、思考,叫做元认知

对认知进行认知,叫做元认知,对学习进行学习,可以理解为元学习。

虽然可以用在线学习来进行强化学习,但是本质上,

在线学习 online learning=让机器用新鲜的数据流学习。

强化学习 reinforcement learning=让机器学习怎么样才能更好地学习。