• Stable Baselines/用户向导/示例

    Stable Baselines官方文档中文版 Github CSDN 尝试翻译官方文档,水平有限,如有错误万望指正

  • 先用Colab Notebook在线试试吧

    下述所有示例都可用Google colab Notebooks执行:

  • 基础用法:训练、保存、载入

    在下述案例,我们会在Lunar Lander(登月飞行器)环境训练、保存并载入一个DQN模型

    Google Colab Notebooks上尝试

    Stable Baselines/用户向导/示例 - 图1

    LunarLander需要box2d这个Python包。可以先apt install swigpip install box2d box2d-kengz实现安装

    每次调用,load函数会从头重建模型,这个过程可能较慢。如果你用不同参数数据集评估同一模型,可以考虑用load_parameters来替代。

    1. import gym
    2. from stable_baselines import DQN
    3. # Create environment
    4. env = gym.make('LunarLander-v2')
    5. # Instantiate the agent
    6. model = DQN('MlpPolicy', env, learning_rate=1e-3, prioritized_replay=True, verbose=1)
    7. # Train the agent
    8. model.learn(total_timesteps=int(2e5))
    9. # Save the agent
    10. model.save("dqn_lunar")
    11. del model # delete trained model to demonstrate loading
    12. # Load the trained agent
    13. model = DQN.load("dqn_lunar")
    14. # Enjoy trained agent
    15. obs = env.reset()
    16. for i in range(1000):
    17. action, _states = model.predict(obs)
    18. obs, rewards, dones, info = env.step(action)
    19. env.render()
  • 多重处理:释放向量化环境的力量

    Google Colab Notebook上测试

    Stable Baselines/用户向导/示例 - 图2

    1. import gym
    2. import numpy as np
    3. from stable_baselines.common.policies import MlpPolicy
    4. from stable_baselines.common.vec_env import SubprocVecEnv
    5. from stable_baselines.common import set_global_seeds
    6. from stable_baselines import ACKTR
    7. def make_env(env_id, rank, seed=0):
    8. """
    9. Utility function for multiprocessed env.
    10. :param env_id: (str) the environment ID
    11. :param num_env: (int) the number of environments you wish to have in subprocesses
    12. :param seed: (int) the inital seed for RNG
    13. :param rank: (int) index of the subprocess
    14. """
    15. def _init():
    16. env = gym.make(env_id)
    17. env.seed(seed + rank)
    18. return env
    19. set_global_seeds(seed)
    20. return _init
    21. env_id = "CartPole-v1"
    22. num_cpu = 4 # Number of processes to use
    23. # Create the vectorized environment
    24. env = SubprocVecEnv([make_env(env_id, i) for i in range(num_cpu)])
    25. model = ACKTR(MlpPolicy, env, verbose=1)
    26. model.learn(total_timesteps=25000)
    27. obs = env.reset()
    28. for _ in range(1000):
    29. action, _states = model.predict(obs)
    30. obs, rewards, dones, info = env.step(action)
    31. env.render()
  • 使用Callback:监控训练

    你可以定义一个在agent内部调用的回调函数。有助于监控训练,比如在Tensorboard(或Visdom)中呈现实时学习曲线或保存最佳agent。如果你的回调函数返回False,说明训练异常退出。

    Google Colab Notebook上测试

    Stable Baselines/用户向导/示例 - 图3

    LunarLanderContinuous环境中DDPG的学习曲线

    ```python import os

    import gym import numpy as np import matplotlib.pyplot as plt

    from stable_baselines.ddpg.policies import LnMlpPolicy from stable_baselines.bench import Monitor from stable_baselines.results_plotter import load_results, ts2xy from stable_baselines import DDPG from stable_baselines.ddpg import AdaptiveParamNoiseSpec

best_mean_reward, n_steps = -np.inf, 0

def callback(_locals, _globals): “”” Callback called at each step (for DQN an others) or after n steps (see ACER or PPO2) :param _locals: (dict) :param _globals: (dict) “”” global n_steps, best_mean_reward

  1. # Print stats every 1000 calls
  2. if (n_steps + 1) % 1000 == 0:
  3. # Evaluate policy training performance
  4. x, y = ts2xy(load_results(log_dir), 'timesteps')
  5. if len(x) > 0:
  6. mean_reward = np.mean(y[-100:])
  7. print(x[-1], 'timesteps')
  8. print("Best mean reward: {:.2f} - Last mean reward per episode: {:.2f}".format(best_mean_reward, mean_reward))
  9. # New best model, you could save the agent here
  10. if mean_reward > best_mean_reward:
  11. best_mean_reward = mean_reward
  12. # Example for saving best model
  13. print("Saving new best model")
  14. _locals['self'].save(log_dir + 'best_model.pkl')
  15. n_steps += 1
  16. return True

Create log dir

log_dir = “/tmp/gym/“ os.makedirs(log_dir, exist_ok=True)

Create and wrap the environment

env = gym.make(‘LunarLanderContinuous-v2’) env = Monitor(env, log_dir, allow_early_resets=True)

Add some param noise for exploration

param_noise = AdaptiveParamNoiseSpec(initial_stddev=0.1, desired_action_stddev=0.1)

Because we use parameter noise, we should use a MlpPolicy with layer normalization

model = DDPG(LnMlpPolicy, env, param_noise=param_noise, verbose=0)

Train the agent

model.learn(total_timesteps=int(1e5), callback=callback)

  1. - ## Atari游戏
  2. ![](https://github.com/DBWangML/stable-baselines-zh/blob/master/%E7%94%A8%E6%88%B7%E5%90%91%E5%AF%BC/%E5%9B%BE%E7%89%87/A2C.gif)
  3. `Breakout`训练好的`A2C`智体
  4. ![](https://github.com/DBWangML/stable-baselines-zh/blob/master/%E7%94%A8%E6%88%B7%E5%90%91%E5%AF%BC/%E5%9B%BE%E7%89%87/Pong.gif)
  5. `Pong`环境
  6. 幸好有make_atari_env帮助函数可以简化Atari游戏RL智体的训练。此函数可为你完成所有预处理和多重处理。
  7. > 在[Google Colab Notebook](<https://colab.research.google.com/drive/1iYK11yDzOOqnrXi1Sfjm1iekZr4cxLaN>)上测试
  8. ```python
  9. from stable_baselines.common.cmd_util import make_atari_env
  10. from stable_baselines.common.vec_env import VecFrameStack
  11. from stable_baselines import ACER
  12. # There already exists an environment generator
  13. # that will make and wrap atari environments correctly.
  14. # Here we are also multiprocessing training (num_env=4 => 4 processes)
  15. env = make_atari_env('PongNoFrameskip-v4', num_env=4, seed=0)
  16. # Frame-stacking with 4 frames
  17. env = VecFrameStack(env, n_stack=4)
  18. model = ACER('CnnPolicy', env, verbose=1)
  19. model.learn(total_timesteps=25000)
  20. obs = env.reset()
  21. while True:
  22. action, _states = model.predict(obs)
  23. obs, rewards, dones, info = env.step(action)
  24. env.render()
  • Mujoco:标准化输入特征

    标准化输入特征对于RL智体的成功训练非常重要(默认情况,图像是缩放的而不是其他输入类型),比如在 Mujoco训练的时候。为此存在一个包装器,用于计算输入特征的运算均值和标准差(对奖励也可如此计算)。

    我们无法为此例提供一个notebook,因为Mujoco是一个专有引擎,需要一份许可证

    1. import gym
    2. from stable_baselines.common.policies import MlpPolicy
    3. from stable_baselines.common.vec_env import DummyVecEnv, VecNormalize
    4. from stable_baselines import PPO2
    5. env = DummyVecEnv([lambda: gym.make("Reacher-v2")])
    6. # Automatically normalize the input features
    7. env = VecNormalize(env, norm_obs=True, norm_reward=False,
    8. clip_obs=10.)
    9. model = PPO2(MlpPolicy, env)
    10. model.learn(total_timesteps=2000)
    11. # Don't forget to save the running average when saving the agent
    12. log_dir = "/tmp/"
    13. model.save(log_dir + "ppo_reacher")
    14. env.save_running_average(log_dir)
  • 自定义策略网络

    Stable baselines为图像(CNN策略)和其他输入类型(Mlp策略)提供默认策略网络。然而,你也可简单地定义一个自定义策略网络架构。(具体见自定义策略部分):

    1. import gym
    2. from stable_baselines.common.policies import FeedForwardPolicy
    3. from stable_baselines.common.vec_env import DummyVecEnv
    4. from stable_baselines import A2C
    5. # Custom MLP policy of three layers of size 128 each
    6. class CustomPolicy(FeedForwardPolicy):
    7. def __init__(self, *args, **kwargs):
    8. super(CustomPolicy, self).__init__(*args, **kwargs,
    9. net_arch=[dict(pi=[128, 128, 128], vf=[128, 128, 128])],
    10. feature_extraction="mlp")
    11. model = A2C(CustomPolicy, 'LunarLander-v2', verbose=1)
    12. # Train the agent
    13. model.learn(total_timesteps=100000)
  • 获取并调整模型参数

    load_parametersget_parameters函数用字典将变量名映射到Numpy数组,可通过他们获取模型参数。

    当你评估大量相同网络结构模型、可视化不同网络层、手动调参时,这些函数很有用。

    你可以用get_parameter_list实现访问原始Tensorflow变量。

    下述案例演示了读取参数、调参、通过实现解决CartPole-v1环境的演化策略来载入他们。通过对模型进行A2C策略梯度更新可获得参数的初始估计。

    1. import gym
    2. import numpy as np
    3. from stable_baselines.common.policies import MlpPolicy
    4. from stable_baselines.common.vec_env import DummyVecEnv
    5. from stable_baselines import A2C
    6. def mutate(params):
    7. """Mutate parameters by adding normal noise to them"""
    8. return dict((name, param + np.random.normal(size=param.shape))
    9. for name, param in params.items())
    10. def evaluate(env, model):
    11. """Return mean fitness (sum of episodic rewards) for given model"""
    12. episode_rewards = []
    13. for _ in range(10):
    14. reward_sum = 0
    15. done = False
    16. obs = env.reset()
    17. while not done:
    18. action, _states = model.predict(obs)
    19. obs, reward, done, info = env.step(action)
    20. reward_sum += reward
    21. episode_rewards.append(reward_sum)
    22. return np.mean(episode_rewards)
    23. # Create env
    24. env = gym.make('CartPole-v1')
    25. env = DummyVecEnv([lambda: env])
    26. # Create policy with a small network
    27. model = A2C(MlpPolicy, env, ent_coef=0.0, learning_rate=0.1,
    28. policy_kwargs={'net_arch': [8, ]})
    29. # Use traditional actor-critic policy gradient updates to
    30. # find good initial parameters
    31. model.learn(total_timesteps=5000)
    32. # Get the parameters as the starting point for ES
    33. mean_params = model.get_parameters()
    34. # Include only variables with "/pi/" (policy) or "/shared" (shared layers)
    35. # in their name: Only these ones affect the action.
    36. mean_params = dict((key, value) for key, value in mean_params.items()
    37. if ("/pi/" in key or "/shared" in key))
    38. for iteration in range(10):
    39. # Create population of candidates and evaluate them
    40. population = []
    41. for population_i in range(100):
    42. candidate = mutate(mean_params)
    43. # Load new policy parameters to agent.
    44. # Tell function that it should only update parameters
    45. # we give it (policy parameters)
    46. model.load_parameters(candidate, exact_match=False)
    47. fitness = evaluate(env, model)
    48. population.append((candidate, fitness))
    49. # Take top 10% and use average over their parameters as next mean parameter
    50. top_candidates = sorted(population, key=lambda x: x[1], reverse=True)[:10]
    51. mean_params = dict(
    52. (name, np.stack([top_candidate[0][name] for top_candidate in top_candidates]).mean(0))
    53. for name in mean_params.keys()
    54. )
    55. mean_fitness = sum(top_candidate[1] for top_candidate in top_candidates) / 10.0
    56. print("Iteration {:<3} Mean top fitness: {:.2f}".format(iteration, mean_fitness))
  • 迭代策略

    这个示例展示如何训练并测试一个递归策略。

    迭代策略的一个当前限制是,你必须用与训练时相同数量的环境进行测试。

    1. from stable_baselines import PPO2
    2. # For recurrent policies, with PPO2, the number of environments run in parallel
    3. # should be a multiple of nminibatches.
    4. model = PPO2('MlpLstmPolicy', 'CartPole-v1', nminibatches=1, verbose=1)
    5. model.learn(50000)
    6. # Retrieve the env
    7. env = model.get_env()
    8. obs = env.reset()
    9. # Passing state=None to the predict function means
    10. # it is the initial state
    11. state = None
    12. # When using VecEnv, done is a vector
    13. done = [False for _ in range(env.num_envs)]
    14. for _ in range(1000):
    15. # We need to pass the previous state and a mask for recurrent policies
    16. # to reset lstm state when a new episode begin
    17. action, state = model.predict(obs, state=state, mask=done)
    18. obs, reward , done, _ = env.step(action)
    19. # Note: with VecEnv, env.reset() is automatically called
    20. # Show the env
    21. env.render()
  • 事后经验回放(HER)

    在此例,我们用 @eleurent提供的Highway-Env

    Google Colab Notebook上测试

    Stable Baselines/用户向导/示例 - 图4

    parking环境是一个以目标为环境的连续控制任务,车辆必须停在划定范围内。

    下述超参数是上述环境下的优化

    ```python import gym import highway_env import numpy as np

    from stable_baselines import HER, SAC, DDPG, TD3 from stable_baselines.ddpg import NormalActionNoise

    env = gym.make(“parking-v0”)

    Create 4 artificial transitions per real transition

    n_sampled_goal = 4

    SAC hyperparams:

    model = HER(‘MlpPolicy’, env, SAC, n_sampled_goal=n_sampled_goal,

    1. goal_selection_strategy='future',
    2. verbose=1, buffer_size=int(1e6),
    3. learning_rate=1e-3,
    4. gamma=0.95, batch_size=256,
    5. policy_kwargs=dict(layers=[256, 256, 256]))

    DDPG Hyperparams:

    NOTE: it works even without action noise

    n_actions = env.action_space.shape[0]

    noise_std = 0.2

    action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=noise_std * np.ones(n_actions))

    model = HER(‘MlpPolicy’, env, DDPG, n_sampled_goal=n_sampled_goal,

    goal_selection_strategy=’future’,

    verbose=1, buffer_size=int(1e6),

    actor_lr=1e-3, critic_lr=1e-3, action_noise=action_noise,

    gamma=0.95, batch_size=256,

    policy_kwargs=dict(layers=[256, 256, 256]))

model.learn(int(2e5)) model.save(‘her_sac_highway’)

Load saved model

model = HER.load(‘her_sac_highway’, env=env)

obs = env.reset()

Evaluate the agent

episodereward = 0 for in range(100): action, _ = model.predict(obs) obs, reward, done, info = env.step(action) env.render() episode_reward += reward if done or info.get(‘is_success’, False): print(“Reward:”, episode_reward, “Success?”, info.get(‘is_success’, False)) episode_reward = 0.0 obs = env.reset()

  1. - ## 持续学习
  2. 你还可以从一个环境的学习转移到另一个以实现连续学习(`PPO2` 先在`DemonAttack-v0`学习,然后转到`SpaceInvaders-v0`):
  3. ```python
  4. from stable_baselines.common.cmd_util import make_atari_env
  5. from stable_baselines import PPO2
  6. # There already exists an environment generator
  7. # that will make and wrap atari environments correctly
  8. env = make_atari_env('DemonAttackNoFrameskip-v4', num_env=8, seed=0)
  9. model = PPO2('CnnPolicy', env, verbose=1)
  10. model.learn(total_timesteps=10000)
  11. obs = env.reset()
  12. for i in range(1000):
  13. action, _states = model.predict(obs)
  14. obs, rewards, dones, info = env.step(action)
  15. env.render()
  16. # The number of environments must be identical when changing environments
  17. env = make_atari_env('SpaceInvadersNoFrameskip-v4', num_env=8, seed=0)
  18. # change env
  19. model.set_env(env)
  20. model.learn(total_timesteps=10000)
  21. obs = env.reset()
  22. while True:
  23. action, _states = model.predict(obs)
  24. obs, rewards, dones, info = env.step(action)
  25. env.render()
  • 记录视频

    记录mp4格式视频(此处使用随机智体)。

    本例要求安装ffmpegavconv

    1. import gym
    2. from stable_baselines.common.vec_env import VecVideoRecorder, DummyVecEnv
    3. env_id = 'CartPole-v1'
    4. video_folder = 'logs/videos/'
    5. video_length = 100
    6. env = DummyVecEnv([lambda: gym.make(env_id)])
    7. obs = env.reset()
    8. # Record the video starting at the first step
    9. env = VecVideoRecorder(env, video_folder,
    10. record_video_trigger=lambda x: x == 0, video_length=video_length,
    11. name_prefix="random-agent-{}".format(env_id))
    12. env.reset()
    13. for _ in range(video_length + 1):
    14. action = [env.action_space.sample()]
    15. obs, _, _, _ = env.step(action)
    16. env.close()
  • 好处:制作训练好智体的GIF图片

    对于Atari游戏,你需要用 Kazam这种屏幕录像。然后用 ffmpeg转换视频

    1. import imageio
    2. import numpy as np
    3. from stable_baselines.common.policies import MlpPolicy
    4. from stable_baselines import A2C
    5. model = A2C(MlpPolicy, "LunarLander-v2").learn(100000)
    6. images = []
    7. obs = model.env.reset()
    8. img = model.env.render(mode='rgb_array')
    9. for i in range(350):
    10. images.append(img)
    11. action, _ = model.predict(obs)
    12. obs, _, _ ,_ = model.env.step(action)
    13. img = model.env.render(mode='rgb_array')
    14. imageio.mimsave('lander_a2c.gif', [np.array(img[0]) for i, img in enumerate(images) if i%2 == 0], fps=29)