Training
On-policy:The actor on train and the actor for interacting is the same.
1.初始化actor参数
2.执行一定数量的episode,收集(s,a,r)资料(状态、动作、reward)
3.根据上述资料定义特定状态下某些行为的好坏,具体定义方式见PDF
4.根据定义好的行为对actor进行训练,比如定义在S状态下行为a是好的,那么就训练actor在该状态下以较大的概率输出动作a
5.训练完后得到actor模型参数
6.回到2,再次训练得到模型,直至训练完成
Off-policy:The actor on train and the actor for interactin is different
The actor to train has to know its difference from the actor to interact.
比如:PPO(后续研究)
Critic
Critic does not directly determine the action
Given an actor ,it evaluates how good the actor is
比如:价值函数 Value function ——对于参数为
的actor,在state为S时的期望discounted cumulated reward.
How to estimate
(两种方法,具体见PDF)
Monte-Carlo(MC)
Temporal-difference(TD)
Application
Advantage Actor-Critic
Tips of Actor-Critic
见PDF
DQN
见PDF链接
Reward Shaping
Imitation Learning
no reward
have the demonstration(示范) of the expert(ususlly human)
Inverse RL
根据上述的expert demonstration反推reward function
1.初始化actor
2.使用上述actor跟环境互动,得到动作-环境序列
3.同时根据expert 也会产生一个序列
4.定义reward function,期待其给expert的action高的reward,而给actor的action相对较低的reward
5.回到2,更新actor,更新reward function,重复训练直至结束
the same idea as GAN
