极大似然估计

Trust Region Policy Optimization, or TRPO, is a policy gradient method in reinforcement learning that avoids parameter updates that change the policy too much with a KL divergence constraint on the size of the policy update at each iteration.

Take the case of off-policy reinforcement learning, where the policy β for collecting trajectories on rollout workers is different from the policy π to optimize for. The objective function in an off-policy model measures the total advantage over the state visitation distribution and actions, while the mismatch between the training data distribution and the true policy state distribution is compensated with an importance sampling estimator:
$$
J\left(\theta\right) = \sum_{s\in{S}}p^{\pi_{\theta_{old}}}\sum_{a\in\mathcal{A}}\left(\beta\left(a\mid{s}\right)\frac{\pi_{\theta}\left(a\mid{s}\right)}{\beta\left(a\mid{s}\right)}\hat{A}{\theta{old}}\left(s, a\right)\right)
$$

$$
J\left(\theta\right) = \sum_{s\in{S}}p^{\pi_{\theta_{old}}}\sum_{a\in\mathcal{A}}\left(\beta\left(a\mid{s}\right)\frac{\pi_{\theta}\left(a\mid{s}\right)}{\beta\left(a\mid{s}\right)}\hat{A}{\theta{old}}\left(s, a\right)\right)
$$

$$
J\left(\theta\right) = \mathbb{E}{s\sim{p}^{\pi{\theta_{old}}}, a\sim{\beta}} \left(\frac{\pi_{\theta}\left(a\mid{s}\right)}{\beta\left(a\mid{s}\right)}\hat{A}{\theta{old}}\left(s, a\right)\right)
$$

When training on policy, theoretically the policy for collecting data is same as the policy that we want to optimize. However, when rollout workers and optimizers are running in parallel asynchronously, the behavior policy can get stale. TRPO considers this subtle difference: It labels the behavior policy as πθold(a∣s) and thus the objective function becomes:
$$
J\left(\theta\right) = \mathbb{E}{s\sim{p}^{\pi{\theta_{old}}}, a\sim{\pi_{\theta_{old}}}} \left(\frac{\pi_{\theta}\left(a\mid{s}\right)}{\pi_{\theta_{old}}\left(a\mid{s}\right)}\hat{A}{\theta{old}}\left(s, a\right)\right)
$$
TRPO aims to maximize the objective function J(θ) subject to a trust region constraint which enforces the distance between old and new policies measured by KL-divergence to be small enough, within a parameter δ:
$$
\mathbb{E}{s\sim{p}^{\pi{\theta_{old}}}} \left[D_{KL}\left(\pi_{\theta_{old}}\left(.\mid{s}\right)\mid\mid\pi_{\theta}\left(.\mid{s}\right)\right)\right] \leq \delta
$$

https://zhuanlan.zhihu.com/p/384334291

https://blog.csdn.net/qq_43616565/article/details/121090957

https://zhuanlan.zhihu.com/p/331850355?utm_source=wechat_session

https://zhuanlan.zhihu.com/p/114866455

https://zhuanlan.zhihu.com/p/26308073

RL items

Posted on 2022-05-04 Edited on 2022-05-06

https://blog.csdn.net/qq_33328642/article/details/123683755

non-stationray，sample efficiency，planning和Learnin，Reward，off-policy和on-policy Infinite horizon finite horizon Regrets

non-stationray：https://stepneverstop.github.io/rl-classification.html

Stationary or not

根据环境十分稳定、可以将强化学习问题分为stationary、non-stationary。

如果状态转移和奖励函数是确定的，即选择动作aa后执行它的结果是确定的，那么这个环境就是stationary。

如果状态转移或奖励函数是不确定的，即选择动作aa后执行它的结果是不确定的，那么这个环境就是non-stationary。

A stationary (平稳) policy, 𝜋𝑡πt, is a policy that does not change over time, that is, 𝜋𝑡=𝜋,∀𝑡≥0πt=π,∀t≥0, where 𝜋π can either be a function, 𝜋:𝑆→𝐴π:S→A (a deterministic (确定性) policy), or a conditional (条件) density, 𝜋(𝐴∣𝑆)π(A∣S) (a stochastic (随机) policy). A non-stationary policy is a policy that is not stationary (平稳) . More precisely, 𝜋𝑖πi may not be equal to 𝜋𝑗πj, for 𝑖≠𝑗≥0i≠j≥0, where 𝑖i and 𝑗j are thus two different time steps.

强化学习的样本效率sample efficiency：https://blog.csdn.net/wxc971231/article/details/120992949

horizon 这个词在各种强化学习教程里出现的频率不算高，但它也是要了解的一个概念。
先查词典：
n. 地平线；视野；眼界；范围

【强化学习】Deep Q Network(DQN)算法详解

Posted on 2022-02-20 Edited on 2022-03-04

https://blog.csdn.net/qq_30615903/article/details/80744083

https://zhuanlan.zhihu.com/p/107874859

https://liubingqing.blog.csdn.net/article/details/121595512?spm=1001.2101.3001.6661.1&utm_medium=distribute.pc_relevant_t0.none-task-blog-2%7Edefault%7ECTRLIST%7ERate-1.pc_relevant_paycolumn_v3&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-2%7Edefault%7ECTRLIST%7ERate-1.pc_relevant_paycolumn_v3&utm_relevant_index=1

线程

Posted on 2021-11-29

看了一遍排在前面的答案，类似”进程是资源分配的最小单位，线程是CPU调度的最小单位“这样的回答感觉太抽象，都不太容易让人理解。

做个简单的比喻：进程=火车，线程=车厢

线程在进程下行进（单纯的车厢无法运行）
一个进程可以包含多个线程（一辆火车可以有多个车厢）
不同进程间数据很难共享（一辆火车上的乘客很难换到另外一辆火车，比如站点换乘）
同一进程下不同线程间数据很易共享（A车厢换到B车厢很容易）
进程要比线程消耗更多的计算机资源（采用多列火车相比多个车厢更耗资源）
进程间不会相互影响，一个线程挂掉将导致整个进程挂掉（一列火车不会影响到另外一列火车，但是如果一列火车上中间的一节车厢着火了，将影响到所有车厢）
进程可以拓展到多机，进程最多适合多核（不同火车可以开在多个轨道上，同一火车的车厢不能在行进的不同的轨道上）
进程使用的内存地址可以上锁，即一个线程使用某些共享内存时，其他线程必须等它结束，才能使用这一块内存。（比如火车上的洗手间）－”互斥锁”
进程使用的内存地址可以限定使用量（比如火车上的餐厅，最多只允许多少人进入，如果满了需要在门口等，等有人出来了才能进去）－“信号量”

import threading
import time
import random
 
 
def takeSleep(id, name):
    print(name+'-'+id+':线程开始运行...')
    time.sleep(random.randint(0, 3))
    print(name+'-'+id+':线程任务结束')
 
 
print('主程序开始运行...')
threads = []
for i in range(0, 5):
    t = threading.Thread(target=takeSleep, args=(str(i), 'zhangphil'))
    threads.append(t)
    t.start()
 
print('主程序运行中...')
 
#等待所有线程任务结束。
for t in threads:
    t.join()
 
print("所有线程任务完成")


主程序开始运行...
zhangphil-0:线程开始运行...
zhangphil-1:线程开始运行...
zhangphil-0:线程任务结束zhangphil-2:线程开始运行...
zhangphil-3:线程开始运行...
 
zhangphil-1:线程任务结束
zhangphil-4:线程开始运行...主程序运行中...
 
zhangphil-4:线程任务结束
zhangphil-2:线程任务结束
zhangphil-3:线程任务结束
所有线程任务完成

from multiprocessing import  Process

def fun1(name):
    print('测试%s多进程' %name)

if __name__ == '__main__':
    process_list = []
    for i in range(5):  #开启5个子进程执行fun1函数
        p = Process(target=fun1,args=('Python',)) #实例化进程对象
        p.start()
        process_list.append(p)

    for i in process_list:
        p.join()

    print('结束测试')

测试Python多进程
测试Python多进程
测试Python多进程
测试Python多进程
测试Python多进程
结束测试

import threading
import time


def child_thread1():
    for i in range(100):
        time.sleep(1)
        print('child_thread1_running...')


def parent_thread():
    print('parent_thread_running...')
    thread1 = threading.Thread(target=child_thread1)
    thread1.start()
    print('parent_thread_exit...')


if __name__ == "__main__":
    parent_thread()

parent_thread_running...
parent_thread_exit...
child_thread1_running...
child_thread1_running...
child_thread1_running...
child_thread1_running...
...

获取返回值

线性回归

Posted on 2021-11-20 Edited on 2021-11-21

https://blog.csdn.net/weixin_48077303/article/details/108861439

使用sklearn做各种回归

https://blog.csdn.net/Yeoman92/article/details/75051848

决策树—回归

https://zhuanlan.zhihu.com/p/42505644

多项式回归

https://zhuanlan.zhihu.com/p/77555547

机器学习调参方法：贝叶斯优化

Posted on 2021-11-20 In ML

最优解问题

最简单的，获得最优解的方法，就是网格搜索Grid Search了。

如果网格搜索开销稍微有点大，可以尝试随机搜索Random Search。

如果是凸函数Convex Function，我们可以用Gradient Descent。大量的机器学习算法，都用了这个。如线性回归，逻辑回归等。

如果，这个黑盒函数的开销非常大，又不是凸函数，我们则考虑贝叶斯优化。

贝叶斯优化概念
贝叶斯优化我们把这个黑盒函数叫做目标函数Objective Function。因为目标函数的开销大，我们要给他找一个近似函数，这个函数叫代理函数Surrogate Function。代理函数会计算出一条平均值曲线和对应的标准差（Standard Deviation）。有个代理函数，我们就可以找到一下个探索点。这个过程，用一个获取函数Acquisition Function里实现。

贝叶斯优化，是在一个特定的搜索空间search space展开的。

整个过程如下：

在搜索空间中，选几个初始点X
用目标函数计算初始点X对应的解y
更新代理函数
通过acquisition function获得下一个样本点。
Goto 2

https://github.com/juwikuang/machine_learning_step_by_step/blob/master/bayesian_optimization.ipynb

pandas

Posted on 2021-11-18 In python

二、sort_values()函数的具体参数

用法：
DataFrame.sort_values(by=‘##’,axis=0,ascending=True, inplace=False, na_position=‘last’)

参数说明

参数说明
by 指定列名(axis=0或’index’)或索引值(axis=1或’columns’)
axis 若axis=0或’index’，则按照指定列中数据大小排序；若axis=1或’columns’，则按照指定索引中数据大小排序，默认axis=0
ascending 是否按指定列的数组升序排列，默认为True，即升序排列
inplace 是否用排序后的数据集替换原来的数据，默认为False，即不替换
na_position {‘first’,‘last’}，设定缺失值的显示位置

DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors=’raise’)[source]

Drop specified labels from rows or columns.

默认参数 axis=0，表示对行进行操作，如需对列进行操作需要更改默认参数为 axis=1，

默认参数 inplace=False，表示该删除操作不改变原数据，而是返回一个执行删除操作后的新 dataframe，如需直接在原数据上进行删除操作，需要更改默认参数为 inplace=True

参数说明
labels	就是要删除的行列的名字，用列表给定
axis	默认为 0，指删除行，因此删除 columns 时要指定 axis=1；
index	直接指定要删除的行
columns	直接指定要删除的列
inplace=False，	默认该删除操作不改变原数据，而是返回一个执行删除操作后的新 dataframe；
inplace=True，	则会直接在原数据上进行删除操作，删除后无法返回。

reset_index()方法来重置它们的索引

YANG JIYI

极大似然估计

Imitation Learning

蒙特卡洛树搜索Monte Carlo Tree Search

置信域策略优化算法——TRPO

RL items

non-stationray，sample efficiency，planning和Learnin，Reward，off-policy和on-policy Infinite horizon finite horizon Regrets

强化学习的样本效率sample efficiency：https://blog.csdn.net/wxc971231/article/details/120992949

【强化学习】Deep Q Network(DQN)算法详解

线程

线性回归

使用sklearn做各种回归

决策树—回归

多项式回归

机器学习调参方法：贝叶斯优化

pandas