Reinforcement Learning - 《李宏毅机器学习笔记》

Training
- On-policy:The actor on train and the actor for interacting is the same.
- Off-policy:The actor on train and the actor for interactin is different
Critic
- (两种方法，具体见PDF)">How to estimate (两种方法，具体见PDF)
- Application
DQN
Reward Shaping
Imitation Learning
- Inverse RL

env,actor,reward

Training

On-policy:The actor on train and the actor for interacting is the same.

1.初始化actor参数
2.执行一定数量的episode，收集(s,a,r)资料(状态、动作、reward)
3.根据上述资料定义特定状态下某些行为的好坏，具体定义方式见PDF
4.根据定义好的行为对actor进行训练，比如定义在S状态下行为a是好的，那么就训练actor在该状态下以较大的概率输出动作a
5.训练完后得到actor模型参数 Reinforcement Learning - 图2
6.回到2，再次训练得到模型 Reinforcement Learning - 图3 ，直至训练完成

Off-policy:The actor on train and the actor for interactin is different

The actor to train has to know its difference from the actor to interact.
比如：PPO(后续研究)

Critic

Critic does not directly determine the action
Given an actor Reinforcement Learning - 图4 ，it evaluates how good the actor is
比如：价值函数 Value function Reinforcement Learning - 图5 ——对于参数为 Reinforcement Learning - 图6 的actor，在state为S时的期望discounted cumulated reward.

How to estimate (两种方法，具体见PDF)

Monte-Carlo(MC)
Temporal-difference(TD)

Application

Advantage Actor-Critic
Reinforcement Learning - 图8 Reinforcement Learning - 图9

Tips of Actor-Critic
见PDF

DQN

见PDF链接

Reward Shaping

用于解决sparse reward问题

Imitation Learning

no reward
have the demonstration(示范) of the expert(ususlly human)

Inverse RL

根据上述的expert demonstration反推reward function
1.初始化actor
2.使用上述actor跟环境互动，得到动作-环境序列 Reinforcement Learning - 图10
3.同时根据expert 也会产生一个序列 Reinforcement Learning - 图11
4.定义reward function，期待其给expert的action高的reward，而给actor的action相对较低的reward
5.回到2，更新actor，更新reward function，重复训练直至结束
the same idea as GAN

drl_v5.pdf