本篇笔记主要介绍torch.optim模块,主要包含模型训练的优化器Optimizer, 学习率调整策略LRScheduler 以及SWA相关优化策略. 本文中涉及的源码以torch==1.7.0为准. 本文参考了mmlab的知乎文章加工后形成

  1. 优化器 Optimizer
  2. 学习率调节器 lr_scheduler
  3. 随机参数平均 swa_utils
  4. 参考资料

1. 优化器 Optimizer

1.0 基本用法

  • 优化器主要是在模型训练阶段对模型可学习参数进行更新, 常用优化器有 SGDRMSpropAdam
  • 优化器初始化时传入传入模型的可学习参数,以及其他超参数如 lrmomentum
  • 在训练过程中先调用 optimizer.zero_grad() 清空梯度,再调用 loss.backward() 反向传播,最后调用 optimizer.step()更新模型参数
  • 两个方法可用来实现模型训练中断后继续训练功能

简单使用示例如下所示:

  1. import torch
  2. import numpy as np
  3. import warnings
  4. warnings.filterwarnings('ignore') #ignore warnings
  5. x = torch.linspace(-np.pi, np.pi, 2000)
  6. y = torch.sin(x)
  7. p = torch.tensor([1, 2, 3])
  8. xx = x.unsqueeze(-1).pow(p)
  9. model = torch.nn.Sequential(
  10. torch.nn.Linear(3, 1),
  11. torch.nn.Flatten(0, 1)
  12. )
  13. loss_fn = torch.nn.MSELoss(reduction='sum')
  14. learning_rate = 1e-3
  15. optimizer = torch.optim.RMSprop(model.parameters(), lr=learning_rate)
  16. for t in range(1, 1001):
  17. y_pred = model(xx)
  18. loss = loss_fn(y_pred, y)
  19. if t % 100 == 0:
  20. print('No.{: 5d}, loss: {:.6f}'.format(t, loss.item()))
  21. optimizer.zero_grad() # 梯度清零
  22. loss.backward() # 反向传播计算梯度
  23. optimizer.step() # 梯度下降法更新参数

1.1 PyTorch 中的优化器


所有优化器都是继承父类 Optimizer,如下列表是 PyTorch 提供的优化器:

  • SGD
  • ASGD
  • Adadelta
  • Adagrad
  • Adam
  • AdamW
  • Adamax
  • SparseAdam
  • RMSprop
  • Rprop
  • LBFGS

1.2 父类Optimizer 基本原理


Optimizer 是所有优化器的父类,它主要有如下公共方法:

  • add_param_group(param_group): 添加模型可学习参数组
  • step(closure): 进行一次参数更新
  • zero_grad(): 清空上次迭代记录的梯度信息
  • state_dict(): 返回 dict 结构的参数状态
  • load_state_dict(state_dict): 加载 dict 结构的参数状态

1.2.1 初始化 Optimizer


初始化优化器只需要将模型的可学习参数(params)和超参数(defaults)分别传入优化器的构造函数,下面是Optimizer的初始化函数核心代码:

1.2.2 add_param_group

该方法在初始化函数中用到,主要用来向 self.param_groups添加不同分组的模型参数

  1. def add_param_group(self, param_group):
  2. r"""Add a param group to the :class:`Optimizer` s `param_groups`.
  3. This can be useful when fine tuning a pre-trained network as frozen layers can be made
  4. trainable and added to the :class:`Optimizer` as training progresses.
  5. Arguments:
  6. param_group (dict): Specifies what Tensors should be optimized along with group
  7. specific optimization options.
  8. """
  9. assert isinstance(param_group, dict), "param group must be a dict"
  10. params = param_group['params']
  11. if isinstance(params, torch.Tensor):
  12. param_group['params'] = [params]
  13. elif isinstance(params, set):
  14. raise TypeError('optimizer parameters need to be organized in ordered collections, but '
  15. 'the ordering of tensors in sets will change between runs. Please use a list instead.')
  16. else:
  17. param_group['params'] = list(params)
  18. for param in param_group['params']:
  19. if not isinstance(param, torch.Tensor):
  20. raise TypeError("optimizer can only optimize Tensors, "
  21. "but one of the params is " + torch.typename(param))
  22. if not param.is_leaf:
  23. raise ValueError("can't optimize a non-leaf Tensor")
  24. # 利用默认参数给所有组设置统一的超参
  25. for name, default in self.defaults.items():
  26. if default is required and name not in param_group:
  27. raise ValueError("parameter group didn't specify a value of required optimization parameter "+name)
  28. else:
  29. param_group.setdefault(name, default)
  30. params = param_group['params']
  31. if len(params) != len(set(params)):
  32. warnings.warn("optimizer contains a parameter group with duplicate parameters; "
  33. "in future, this will cause an error; "
  34. "see github.com/pytorch/pytorch/issues/40967 for more information", stacklevel=3)
  35. param_set = set()
  36. for group in self.param_groups:
  37. param_set.update(set(group['params']))
  38. if not param_set.isdisjoint(set(param_group['params'])):
  39. raise ValueError("some parameters appear in more than one parameter group")
  40. self.param_groups.append(param_group)

利用 add_param_group 函数功能,可以对模型不同的可学习参数组设定不同的超参数,初始化优化器可传入元素是 dict 的 list,每个 dict 中的 key 是 params 或者其他超参数的名字如 lr,下面是一个实用的例子:对模型的fc层参数设置不同的学习率

  1. from torch.optim import SGD
  2. from torch import nn
  3. class DummyModel(nn.Module):
  4. def __init__(self, class_num=10):
  5. super(DummyModel, self).__init__()
  6. self.base = nn.Sequential(
  7. nn.Conv2d(3, 64, kernel_size=3, padding=1),
  8. nn.ReLU(),
  9. nn.Conv2d(64, 128, kernel_size=3, padding=1),
  10. nn.ReLU(),
  11. )
  12. self.gap = nn.AdaptiveAvgPool2d(1)
  13. self.fc = nn.Linear(128, class_num)
  14. def forward(self, x):
  15. x = self.base(x)
  16. x = self.gap(x)
  17. x = x.view(x.shape[0], -1)
  18. x = self.fc(x)
  19. return x
  20. model = DummyModel().cuda()
  21. optimizer = SGD([
  22. {'params': model.base.parameters()},
  23. {'params': model.fc.parameters(), 'lr': 1e-3} # 对 fc的参数设置不同的学习率
  24. ], lr=1e-2, momentum=0.9)

1.2.3 step

此方法主要完成一次模型参数的更新

  • 基类 Optimizer 定义了 step 方法接口,如下所示
  1. def step(self, closure):
  2. r"""Performs a single optimization step (parameter update).
  3. Arguments:
  4. closure (callable): A closure that reevaluates the model and
  5. returns the loss. Optional for most optimizers.
  6. .. note::
  7. Unless otherwise specified, this function should not modify the
  8. ``.grad`` field of the parameters.
  9. """
  10. raise NotImplementedError


  • 子类如 SGD 需要实现 step 方法,如下所示:
  1. @torch.no_grad()
  2. def step(self, closure=None):
  3. """Performs a single optimization step.
  4. Arguments:
  5. closure (callable, optional): A closure that reevaluates the model
  6. and returns the loss.
  7. """
  8. loss = None
  9. if closure is not None:
  10. with torch.enable_grad():
  11. loss = closure()
  12. for group in self.param_groups:
  13. weight_decay = group['weight_decay']
  14. momentum = group['momentum']
  15. dampening = group['dampening']
  16. nesterov = group['nesterov']
  17. for p in group['params']:
  18. if p.grad is None:
  19. continue
  20. d_p = p.grad
  21. if weight_decay != 0:
  22. d_p = d_p.add(p, alpha=weight_decay)
  23. if momentum != 0:
  24. param_state = self.state[p]
  25. if 'momentum_buffer' not in param_state:
  26. buf = param_state['momentum_buffer'] = torch.clone(d_p).detach()
  27. else:
  28. buf = param_state['momentum_buffer']
  29. buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
  30. if nesterov:
  31. d_p = d_p.add(buf, alpha=momentum)
  32. else:
  33. d_p = buf
  34. p.add_(d_p, alpha=-group['lr'])
  35. return loss
  • step 方法可传入闭包函数 closure,主要目的是为了实现如Conjugate Gradient和LBFGS等优化算法,这些算法需要对模型进行多次评估
  • Python 中闭包概念:在一个内部函数中,对外部作用域的变量进行引用(并且一般外部函数的返回值为内部函数),那么内部函数就被认为是闭包

下面是 closure 的简单示例:

  1. from torch.nn import CrossEntropyLoss
  2. dummy_model = DummyModel().cuda()
  3. optimizer = SGD(dummy_model.parameters(), lr=1e-2, momentum=0.9, weight_decay=1e-4)
  4. # 定义loss
  5. loss_fn = CrossEntropyLoss()
  6. # 定义数据
  7. batch_size = 2
  8. data = torch.randn(64, 3, 64, 128).cuda() # 制造假数据shape=64 * 3 * 64 * 128
  9. data_label = torch.randint(0, 10, size=(64,), dtype=torch.long).cuda() # 制造假的label
  10. for batch_index in range(10):
  11. batch_data = data[batch_index*batch_size: batch_index*batch_size + batch_size]
  12. batch_label = data_label[batch_index*batch_size: batch_index*batch_size + batch_size]
  13. def closure():
  14. optimizer.zero_grad() # 清空梯度
  15. output = dummy_model(batch_data) # forward
  16. loss = loss_fn(output, batch_label) # 计算loss
  17. loss.backward() # backward
  18. print('No.{: 2d} loss: {:.6f}'.format(batch_index, loss.item()))
  19. return loss
  20. optimizer.step(closure=closure) # 更新参数


1.2.4 zero_grad

  • 在反向传播计算梯度之前对上一次迭代时记录的梯度清零,参数set_to_none 设置为 True 时会直接将参数梯度设置为 None,从而减小内存使用, 但通常情况下不建议设置这个参数,因为梯度设置为 None 和 0 在 PyTorch 中处理逻辑会不一样。
  1. def zero_grad(self, set_to_none: bool = False):
  2. r"""Sets the gradients of all optimized :class:`torch.Tensor` s to zero.
  3. Arguments:
  4. set_to_none (bool): instead of setting to zero, set the grads to None.
  5. This is will in general have lower memory footprint, and can modestly improve performance.
  6. However, it changes certain behaviors. For example:
  7. 1. When the user tries to access a gradient and perform manual ops on it,
  8. a None attribute or a Tensor full of 0s will behave differently.
  9. 2. If the user requests ``zero_grad(set_to_none=True)`` followed by a backward pass, ``.grad``s
  10. are guaranteed to be None for params that did not receive a gradient.
  11. 3. ``torch.optim`` optimizers have a different behavior if the gradient is 0 or None
  12. (in one case it does the step with a gradient of 0 and in the other it skips
  13. the step altogether).
  14. """
  15. for group in self.param_groups:
  16. for p in group['params']:
  17. if p.grad is not None:
  18. if set_to_none:
  19. p.grad = None
  20. else:
  21. if p.grad.grad_fn is not None:
  22. p.grad.detach_()
  23. else:
  24. p.grad.requires_grad_(False)
  25. p.grad.zero_()

1.2.5 state_dict() 和 load_state_dict


这两个方法实现序列化和反序列化功能。

  • state_dict(): 将优化器管理的参数和其状态信息以 dict 形式返回
  • load_state_dict(state_dict): 加载之前返回的 dict,更新参数和其状态
  • 两个方法可用来实现模型训练中断后继续训练功能
  1. def state_dict(self):
  2. r"""Returns the state of the optimizer as a :class:`dict`.
  3. It contains two entries:
  4. * state - a dict holding current optimization state. Its content
  5. differs between optimizer classes.
  6. * param_groups - a dict containing all parameter groups
  7. """
  8. # Save order indices instead of Tensors
  9. param_mappings = {}
  10. start_index = 0
  11. def pack_group(group):
  12. nonlocal start_index
  13. packed = {k: v for k, v in group.items() if k != 'params'}
  14. param_mappings.update({id(p): i for i, p in enumerate(group['params'], start_index)
  15. if id(p) not in param_mappings})
  16. packed['params'] = [param_mappings[id(p)] for p in group['params']]
  17. start_index += len(packed['params'])
  18. return packed
  19. param_groups = [pack_group(g) for g in self.param_groups]
  20. # Remap state to use order indices as keys
  21. packed_state = {(param_mappings[id(k)] if isinstance(k, torch.Tensor) else k): v
  22. for k, v in self.state.items()}
  23. return {
  24. 'state': packed_state,
  25. 'param_groups': param_groups,
  26. }


1.3 常见优化器简介


1.3.1 SGD(params, lr, momentum=0, dampening=0, weight_decay=0, nesterov=False)


实现带momentum和dampening的 SGD,公式如下:

equation.svg

1.3.2 Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-10)

自适应学习率,考虑历史所有梯度信息, 公式如下:


equation.svg

1.3.3 RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)


加权考虑历史梯度和当前梯度,历史梯度系数是 Pytorch之torch.optim - 图3,当前梯度系数是 Pytorch之torch.optim - 图4
equation.svg

1.3.4 Adam(params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0, amsgrad=False)


实现了自适应学习率有优化器, Adam 是 Momentum 和 RMSprop 的结合 主要超参数有 Pytorch之torch.optim - 图6Pytorch之torch.optim - 图7 ,eps。 公式如下:
其中, Pytorch之torch.optim - 图8Pytorch之torch.optim - 图9 分别是对梯度的一阶矩估计和二阶矩估计,可以看作对期望 E[gt]、E[g_t^2] 的近似; Pytorch之torch.optim - 图10Pytorch之torch.optim - 图11 是校正,这样可以近似为对期望的无偏估计
equation.svg

2 学习率调节器 lr_scheduler


有了优化器,还需要根据 epoch 来调整学习率,lr_schedluer提供了在训练模型时学习率的调整策略。

目前 PyTorch 提供了如下学习率调整策略:

  • StepLR: 等间隔调整策略
  • MultiStepLR: 多阶段调整策略
  • ExponentialLR: 指数衰减调整策略
  • ReduceLROnPlateau: 自适应调整策略
  • CyclicLR: 循环调整策略
  • OneCycleLR: 单循环调整策略
  • CosineAnnealingLR: 余弦退火调整策略
  • CosineAnnealingWarmRestarts: 带预热启动的余弦退火调整策略
  • LambdaLR: 自定义函数调整策略
  • MultiplicativeLR: 乘法调整策略

学习率调整策略可粗略分为以下三大类:

  • 有序调整策略:

    • StepLR
    • MultiStepLR
    • ExponentialLR
    • CyclicLR
    • OneCycleLR
    • CosineAnnealingLR
    • CosineAnnealingWarmRestarts
  • 自适应调整策略:

    • ReduceLROnPlateau
  • 自定义调整策略:

    • LambdaLR
    • MultiplicativeLR


2.1 基类: _LRScheduler

学习率调整类主要的逻辑功能就是每个 epoch 计算参数组的学习率,更新 optimizer对应参数组中的lr值,从而应用在optimizer里可学习参数的梯度更新。所有的学习率调整策略类的父类是torch.optim.lr_scheduler._LRScheduler,基类 _LRScheduler 定义了如下方法:

  • step(epoch=None): 子类公用
  • get_lr(): 子类需要实现
  • get_last_lr(): 子类公用
  • print_lr(is_verbose, group, lr, epoch=None): 显示 lr 调整信息
  • state_dict(): 子类可能会重写
  • load_state_dict(state_dict): 子类可能会重写

2.1.1 初始化


基类的初始化函数可传入两个参数, 第一是optimizer就是之前我们讲过的优化器的实例,第二个参数lastepoch是最后一次 epoch 的 index,默认值是 -1,代表初次训练模型,此时会对optimizer里的各参数组设置初始学习率 initiallr。若last_epoch传入值大于 -1,则代表从某个 epoch 开始继续上次训练,此时要求optimizer的参数组中有initial_lr初始学习率信息。初始化函数内部的 with_counter 函数主要是为了确保lr_scheduler.step()是在optimizer.step()之后调用的 (PyTorch=1.1 发生变化). 注意在__init函数最后一步调用了self.step(),即_LRScheduler在初始化时已经调用过一次step()方法。

  1. class _LRScheduler(object):
  2. def __init__(self, optimizer, last_epoch=-1, verbose=False):
  3. # Attach optimizer
  4. if not isinstance(optimizer, Optimizer):
  5. raise TypeError('{} is not an Optimizer'.format(
  6. type(optimizer).__name__))
  7. self.optimizer = optimizer
  8. # Initialize epoch and base learning rates
  9. if last_epoch == -1:
  10. for group in optimizer.param_groups:
  11. group.setdefault('initial_lr', group['lr'])
  12. else:
  13. for i, group in enumerate(optimizer.param_groups):
  14. if 'initial_lr' not in group:
  15. raise KeyError("param 'initial_lr' is not specified "
  16. "in param_groups[{}] when resuming an optimizer".format(i))
  17. self.base_lrs = list(map(lambda group: group['initial_lr'], optimizer.param_groups))
  18. self.last_epoch = last_epoch
  19. # Following https://github.com/pytorch/pytorch/issues/20124
  20. # We would like to ensure that `lr_scheduler.step()` is called after
  21. # `optimizer.step()`
  22. def with_counter(method):
  23. if getattr(method, '_with_counter', False):
  24. # `optimizer.step()` has already been replaced, return.
  25. return method
  26. # Keep a weak reference to the optimizer instance to prevent
  27. # cyclic references.
  28. instance_ref = weakref.ref(method.__self__)
  29. # Get the unbound method for the same purpose.
  30. func = method.__func__
  31. cls = instance_ref().__class__
  32. del method
  33. @wraps(func)
  34. def wrapper(*args, **kwargs):
  35. instance = instance_ref()
  36. instance._step_count += 1
  37. wrapped = func.__get__(instance, cls)
  38. return wrapped(*args, **kwargs)
  39. # Note that the returned function here is no longer a bound method,
  40. # so attributes like `__func__` and `__self__` no longer exist.
  41. wrapper._with_counter = True
  42. return wrapper
  43. self.optimizer.step = with_counter(self.optimizer.step)
  44. self.optimizer._step_count = 0
  45. self._step_count = 0
  46. self.verbose = verbose
  47. self.step()

2.1.3 get_last_lr、get_lr和print_lr

  • get_last_lr()方法比较简单,就是step()方法调用后,记录的最后一次 optimizer各参数组里更新后的学习率信息
  • get_lr() 方法是抽象方法,定义了更新学习率策略的接口,不同子类继承后会有不同的实现.其返回值是[lr1, lr2, …]结构
  • print_lr(is_verbose, group, lr, epoch=None)): 该方法提供了显示 lr 调整信息的功能
  1. def get_last_lr(self):
  2. """ Return last computed learning rate by current scheduler.
  3. """
  4. return self._last_lr
  5. def get_lr(self):
  6. # Compute learning rate using chainable form of the scheduler
  7. raise NotImplementedError
  8. def print_lr(self, is_verbose, group, lr, epoch=None):
  9. """Display the current learning rate.
  10. """
  11. if is_verbose:
  12. if epoch is None:
  13. print('Adjusting learning rate'
  14. ' of group {} to {:.4e}.'.format(group, lr))
  15. else:
  16. print('Epoch {:5d}: adjusting learning rate'
  17. ' of group {} to {:.4e}.'.format(epoch, group, lr))


2.1.4 state_dict 和 load_state_dict



这两个方法和Optimizer里的方法功能是一样的,就是为了保存和重新加载状态信息,需要注意的是,这里不会重复记录self.optimizer属性的状态信息,因为 Optimizer 有自己实现的对应方法。

  • state_dict(): 以字典 dict 形式返回当前实例除 self.optimizer 之外的其他所有属性信息
  • load_state_dict(state_dict): 重新载入之前保存的状态信息
  1. def state_dict(self):
  2. """Returns the state of the scheduler as a :class:`dict`.
  3. It contains an entry for every variable in self.__dict__ which
  4. is not the optimizer.
  5. """
  6. return {key: value for key, value in self.__dict__.items() if key != 'optimizer'}
  7. def load_state_dict(self, state_dict):
  8. """Loads the schedulers state.
  9. Arguments:
  10. state_dict (dict): scheduler state. Should be an object returned
  11. from a call to :meth:`state_dict`.
  12. """
  13. self.__dict__.update(state_dict)


2.2 学习率调整策略示例


2.2.1 StepLR(optimizer, step_size, gamma=0.1, last_epoch=-1, verbose=False)


StepLR是根据 epoch 的等间隔学习率调整策略,实现了get_lr()方法。初始化函数须传入优化器,epoch 间隔 step_size,gamma是学习率的衰减系数,默认是 0.1。

  1. class StepLR(_LRScheduler):
  2. def __init__(self, optimizer, step_size, gamma=0.1, last_epoch=-1, verbose=False):
  3. self.step_size = step_size
  4. self.gamma = gamma
  5. super(StepLR, self).__init__(optimizer, last_epoch, verbose)
  6. def get_lr(self):
  7. if not self._get_lr_called_within_step:
  8. warnings.warn("To get the last learning rate computed by the scheduler, "
  9. "please use `get_last_lr()`.", UserWarning)
  10. if (self.last_epoch == 0) or (self.last_epoch % self.step_size != 0):
  11. return [group['lr'] for group in self.optimizer.param_groups]
  12. return [group['lr'] * self.gamma
  13. for group in self.optimizer.param_groups]
  14. def _get_closed_form_lr(self):
  15. return [base_lr * self.gamma ** (self.last_epoch // self.step_size)
  16. for base_lr in self.base_lrs]
  17. ## 可视化学习率
  18. from torch.optim import lr_scheduler
  19. from matplotlib import pyplot as plt
  20. %matplotlib inline
  21. def create_optimizer():
  22. return SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)
  23. def plot_lr(scheduler, title='', labels=['base'], nrof_epoch=100):
  24. lr_li = [[] for _ in range(len(labels))]
  25. epoch_li = list(range(nrof_epoch))
  26. for epoch in epoch_li:
  27. scheduler.step() # 调用step()方法,计算和更新optimizer管理的参数基于当前epoch的学习率
  28. lr = scheduler.get_last_lr() # 获取当前epoch的学习率
  29. for i in range(len(labels)):
  30. lr_li[i].append(lr[i])
  31. for lr, label in zip(lr_li, labels):
  32. plt.plot(epoch_li, lr, label=label)
  33. plt.grid()
  34. plt.xlabel('epoch')
  35. plt.ylabel('lr')
  36. plt.title(title)
  37. plt.legend()
  38. plt.show()
  39. ## StepLR 可视化学习率
  40. optimizer = create_optimizer()
  41. scheduler = lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
  42. plot_lr(scheduler, title='StepLR')

v2-71259a10c00ce7f2a86d91e33a8a86c3_720w.jpg

2.2.2 MultiStepLR(optimizer, milestones, gamma=0.1, last_epoch=-1, verbose=False)


多阶段学习率调整策略,参数 milestones 是包含多个学习率调整点列表

  1. optimizer = create_optimizer()
  2. scheduler = lr_scheduler.MultiStepLR(optimizer, milestones=[20, 35, 45], gamma=0.5)
  3. plot_lr(scheduler, title='MultiStepLR')

v2-93e2ce394fa89a9e18ab496d2cc58d9a_720w.jpg

2.2.3 MultiplicativeLR(optimizer, lr_lambda, last_epoch=-1, verbose=False)


乘法调整策略实现了学习率的衰减系数 gamma可变,即在每个调整节点,可对各参数组的学习率乘上一个不同的衰减率gamma,初始化函数中lr_lambda参数可以是一个lambda函数,也可是lambda函数列表,每个lambda函数输入是 epoch,输出是gamma。

  1. optimizer = SGD([
  2. {'params': model.base.parameters()},
  3. {'params': model.fc.parameters(), 'lr': 0.05} # 对 fc的参数设置不同的学习率
  4. ], lr=0.1, momentum=0.9)
  5. lambda_base = lambda epoch: 0.5 if epoch % 10 == 0 else 1
  6. lambda_fc = lambda epoch: 0.8 if epoch % 10 == 0 else 1
  7. scheduler = lr_scheduler.MultiplicativeLR(optimizer, [lambda_base, lambda_fc])
  8. plot_lr(scheduler, title='MultiplicativeLR', labels=['base', 'fc'])


v2-9950a2c3b3674ca6b683e12f612c224b_720w.jpg

2.2.4 LambdaLR(optimizer, lr_lambda, last_epoch=-1, verbose=False)


该策略可传入自定义的lambda函数, lambda函数参数为epoch,返回值为学习率。

  1. # LamdbdaLR调用示例
  2. def lambda_foo(epoch):
  3. if epoch < 10:
  4. return (epoch+1) * 1e-3
  5. elif epoch < 40:
  6. return 1e-2
  7. else:
  8. return 1e-3
  9. optimizer = create_optimizer()
  10. scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda_foo)
  11. plot_lr(scheduler, title='LambdaLR')

v2-8e226cece1e9a94846612528b3488f88_720w.jpg

2.2.5 ExponentialLR(optimizer, gamma, last_epoch=-1, verbose=False)


指数衰减学习率调整策略

  1. optimizer = create_optimizer()
  2. scheduler = lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
  3. plot_lr(scheduler, title='ExponentialLR')

v2-4d2604b2dabc8554ebc9cbf6a1e6ff1d_720w.jpg

2.2.6 CosineAnnealingLR(optimizer, T_max, eta_min=0, last_epoch=-1, verbose=False)


余弦退火调整策略,T_max是最大迭代次数, eta_min是最小学习率值,其公式如下,eta_max为初始学习率,T_cur 是自重新启动后的 epoch 数

equation.svg

  1. optimizer = create_optimizer()
  2. scheduler = lr_scheduler.CosineAnnealingLR(optimizer, 10, 1e-5)
  3. plot_lr(scheduler, title='CosineAnnealingLR')

v2-c4b7db894ebcb0cdc4c9c3853587e1d6_720w.jpg

2.2.7 CosineAnnealingWarmRestarts(optimizer, T_0, T_mult=1, eta_min=0, last_epoch=-1, verbose=False)


在 SGDR(Stochastic Gradient Descent with Warm Restarts)中提出:

  • T_0: 第一次启动时的迭代数
  • T_mult: 启动后,改变周期 T 的因子
  • eta_min: 学习率下限
  1. optimizer = create_optimizer()
  2. scheduler = lr_scheduler.CosineAnnealingWarmRestarts(optimizer, 10, 2)
  3. plot_lr(scheduler, title='CosineAnnealingWarmRestarts')

v2-43f57b03de26aeab5b46a8903ca0e3d6_720w.jpg

2.2.8 CyclicLR(optimizer, base_lr, max_lr, step_size_up=2000, step_size_down=None, mode=’triangular’, …)


类似三角波形状的学习率调整策略,以下是几个重要初始化参数:

  • base_lr: 基准学习率,也是最小的学习率
  • max_lr: 学习率上限
  • step_size_up: 一个周期里上升阶段 epoch 数
  • step_size_down: 一个周期里下降阶段 epoch 数
  1. optimizer = create_optimizer()
  2. scheduler = lr_scheduler.CyclicLR(optimizer, 0.01, 0.1, step_size_up=25, step_size_down=10)
  3. plot_lr(scheduler, title='CyclicLR')

v2-a24bb1fc8823ae535e8e1a402a8807d2_720w.jpg

2.2.9 OneCycleLR(optimizer, max_lr, total_steps=None, epochs=None, steps_per_epoch=None, pct_start=0.3,…)


只有 1 次循环的学习率调整策略

  • max_lr: float/list, 学习率调整的上限
  • total_steps: int 循环中的总步数
  1. optimizer = create_optimizer()
  2. scheduler = lr_scheduler.OneCycleLR(optimizer, 0.1, total_steps=100)
  3. plot_lr(scheduler, title='OneCycleLR')

2.2.10 ReduceLROnPlateau(optimizer, mode=’min’, factor=0.1, patience=10, threshold=0.0001, threshold_mode=’rel’, …)


自适应学习率调整策略,比如只有当 loss 在几个 epoch 里都不发生下降时,才调整学习率。注意在调用时,需要在其 step() 方法中传入对应的参考变量,例如: scheduler.step(val_loss)

  • mode: 评价模型训练质量的模式, 传入值为min 或max
  • factor: 学习率衰减因子, 类似gamma
  • patience: 控制何时调整学习率

示例:

  1. optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
  2. scheduler = ReduceLROnPlateau(optimizer, 'min')
  3. for epoch in range(100):
  4. train(...)
  5. val_loss = validate(...)
  6. scheduler.step(val_loss)

3 swa_utils里SWA相关类和函数


该模块中只有 2 个类和一个函数:

  • AveragedModel: 实现 SWA 算法的权重平均模型
  • SWALR: 与AverageModel配合使用的学习率调整策略
  • update_bn: 更新模型中的 bn

3.0 SWA 简介


随机权重平均(SWA)是一种优化算法,在SWA 论文的结果证明,取 SGD 轨迹的多点简单平均值,以一个周期或者不变的学习率,会比传统训练有更好的泛化效果。论文的结果同样了证明了,随机权重平均 (SWA) 可以找到更广的最优值域。

3.1 AveragedModel

  • 该类实现 SWA 算法的权重平均模型,初始化时传入模型 model 和参数平均化函数 avg_fn,然后在初始化函数中对 model的参数进行深拷贝, 注册模型计数器。
  • 在update_parameters(self, model)方法中再次传入模型后,根据参数avg_fn对模型参数进行平均后更新 swa 模型参数。
  1. class AveragedModel(Module):
  2. def __init__(self, model, device=None, avg_fn=None):
  3. super(AveragedModel, self).__init__()
  4. self.module = deepcopy(model)
  5. if device is not None:
  6. self.module = self.module.to(device)
  7. self.register_buffer('n_averaged',
  8. torch.tensor(0, dtype=torch.long, device=device))
  9. # 默认提供了avg_fn,你可以指定
  10. if avg_fn is None:
  11. def avg_fn(averaged_model_parameter, model_parameter, num_averaged):
  12. return averaged_model_parameter + \
  13. (model_parameter - averaged_model_parameter) / (num_averaged + 1)
  14. self.avg_fn = avg_fn
  15. def forward(self, *args, **kwargs):
  16. return self.module(*args, **kwargs)
  17. def update_parameters(self, model):
  18. for p_swa, p_model in zip(self.parameters(), model.parameters()):
  19. device = p_swa.device
  20. p_model_ = p_model.detach().to(device)
  21. if self.n_averaged == 0:
  22. p_swa.detach().copy_(p_model_)
  23. else:
  24. p_swa.detach().copy_(self.avg_fn(p_swa.detach(), p_model_,
  25. self.n_averaged.to(device)))
  26. self.n_averaged += 1

3.2 update_bn


该函数主要是通过传入的某个训练时刻的模型model 和 dataloader,来允许 swa 模型计算和更新 bn

  1. def update_bn(loader, model, device=None):
  2. momenta = {}
  3. for module in model.modules():
  4. if isinstance(module, torch.nn.modules.batchnorm._BatchNorm):
  5. module.running_mean = torch.zeros_like(module.running_mean)
  6. module.running_var = torch.ones_like(module.running_var)
  7. momenta[module] = module.momentum
  8. if not momenta:
  9. return
  10. was_training = model.training
  11. model.train()
  12. for module in momenta.keys():
  13. module.momentum = None
  14. module.num_batches_tracked *= 0
  15. # 重新算BN全局均值和方差
  16. for input in loader:
  17. if isinstance(input, (list, tuple)):
  18. input = input[0]
  19. if device is not None:
  20. input = input.to(device)
  21. model(input)
  22. for bn_module in momenta.keys():
  23. bn_module.momentum = momenta[bn_module]
  24. model.train(was_training)

Example:

  1. loader, model = ...
  2. torch.optim.swa_utils.update_bn(loader, model)

3.3 SWALR


SWALR类继承_LRScheduler基类,实现了供 swa 模型的学习率调整策略

在此就只放出其使用示例:

  1. Example:
  2. >>> loader, optimizer, model = ...
  3. >>> swa_model = torch.optim.swa_utils.AveragedModel(model)
  4. >>> lr_lambda = lambda epoch: 0.9
  5. >>> scheduler = torch.optim.lr_scheduler.MultiplicativeLR(optimizer,
  6. >>> lr_lambda=lr_lambda)
  7. >>> swa_scheduler = torch.optim.swa_utils.SWALR(optimizer,
  8. >>> anneal_strategy="linear", anneal_epochs=20, swa_lr=0.05)
  9. >>> swa_start = 160
  10. >>> for i in range(300):
  11. >>> for input, target in loader:
  12. >>> optimizer.zero_grad()
  13. >>> loss_fn(model(input), target).backward()
  14. >>> optimizer.step()
  15. >>> if i > swa_start:
  16. >>> swa_scheduler.step()
  17. >>> else:
  18. >>> scheduler.step()
  19. >>> # Update bn statistics for the swa_model at the end
  20. >>> torch.optim.swa_utils.update_bn(loader, swa_model)