env,actor,reward

Training

On-policy:The actor on train and the actor for interacting is the same.

1.初始化actor参数
2.执行一定数量的episode,收集(s,a,r)资料(状态、动作、reward)
3.根据上述资料定义特定状态下某些行为的好坏,具体定义方式见PDF
4.根据定义好的行为对actor进行训练,比如定义在S状态下行为a是好的,那么就训练actor在该状态下以较大的概率输出动作a
5.训练完后得到actor模型参数Reinforcement Learning - 图2
6.回到2,再次训练得到模型Reinforcement Learning - 图3,直至训练完成

Off-policy:The actor on train and the actor for interactin is different

The actor to train has to know its difference from the actor to interact.
比如:PPO(后续研究)

Critic

Critic does not directly determine the action
Given an actor Reinforcement Learning - 图4,it evaluates how good the actor is
比如:价值函数 Value function Reinforcement Learning - 图5——对于参数为Reinforcement Learning - 图6的actor,在state为S时的期望discounted cumulated reward.

How to estimate Reinforcement Learning - 图7(两种方法,具体见PDF)

Monte-Carlo(MC)
Temporal-difference(TD)

Application

Advantage Actor-Critic
Reinforcement Learning - 图8 Reinforcement Learning - 图9

Tips of Actor-Critic
见PDF

DQN

见PDF链接

Reward Shaping

用于解决sparse reward问题

Imitation Learning

no reward
have the demonstration(示范) of the expert(ususlly human)

Inverse RL

根据上述的expert demonstration反推reward function
1.初始化actor
2.使用上述actor跟环境互动,得到动作-环境序列Reinforcement Learning - 图10
3.同时根据expert 也会产生一个序列Reinforcement Learning - 图11
4.定义reward function,期待其给expert的action高的reward,而给actor的action相对较低的reward
5.回到2,更新actor,更新reward function,重复训练直至结束
the same idea as GAN

drl_v5.pdf