参考来源:
CSDN:Pytorch模型训练(5) - Optimizer
CSDN:Pyotrch —— 优化器 Optimizer(一)
CSDN:pytorch 中 torch.optim.Adam 方法的使用和参数的解释
本文总结 **Pytorch** 中的 **Optimizer** 。Optimizer 是深度学习模型训练中非常重要的一个模块,它决定参数参数更新的方向,快慢和大小,好的 Optimizer 算法和合适的参数使得模型收敛又快又准。
但本文不会讨论什么任务用什么 Optimizer ,及其参数设置,只是总结下 Pytorch 中的 Optimizer 。
1. 什么是优化器
**Pytorch** 优化器:管理并更新模型中可学习参数的值,使得模型输出更接近真实标签;管理是指优化器管理和修改参数,更新是指优化器的优化策略。优化策略通常采用梯度下降,梯度是一个向量,梯度的方向是使得方向导数最大。
2. torch.optim
Pytorch 的 **torch.optim** 是包含各种优化器算法的包,支持通用优化器方法,接口通用性好,也方便集成更加复杂的算法。
怎样使用一个 **Optimizer** ???
要使用 Optimizer,我们首先要创建一个 Optimizer 对象,该对象会保持当前状态,并根据计算梯度来更新参数。
2.1 创建 Optimizer
创建 Optimizer 时,需要为其提供一些需要迭代的参数进行迭代,还可以指定一些可选的,特定的,用于优化的参数,如学习率,权重衰减等参数。
Example:
optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9)optimizer = optim.Adam([var1, var2], lr = 0.0001)
注意1:如果需要将模型移到 GPU ,可以通过 “.cuda“ 来实现。
optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9).cuda()optimizer = optim.Adam([var1, var2], lr = 0.0001).cuda()
注意2:在训练中,最好保持模型和优化在相同位置,即要在 GPU,都在 GPU 。
2.2 Optimizer 参数
Optimizer 支持特殊参数指定选项,这样需要用一个字典(dict)类型的可迭代参数代替变量(Variable)可迭代参数;它们有各自的参数组,用 “params“ 关键字将他们独立开(包含属于它的参数列表)。
在需要不同层不同参数时,非常有用,如:
optim.SGD([{'params': model.base.parameters()},{'params': model.classifier.parameters(), 'lr': 1e-3}], lr=1e-2, momentum=0.9)
也就是,classifier.parameters 的学习率为 1e-3 ,base.parameters 的学习率为 le-2 ,动量 0.9 适用所有参数。
2.3 Optimizer 迭代
迭代,更新参数,一般有下面 2 种方式:
方式 1 :
optimizer.step()
该方式能满足大多需求,一般只要进行梯度需要,如 backward(),这个 step() 函数就需要被召唤
Example:
for input, target in dataset:optimizer.zero_grad()output = model(input)loss = loss_fn(output, target)loss.backward()optimizer.step()
方式 2 :
optimizer.step(closure)
一些特殊算法,如共轭梯度(Conjugate Gradient) 和 LBFGS 需要多次重新评估函数,所以需要传入一个允许重新计算模型的闭包(closure),来清理梯度,计算 loss 并返回。
Example:
for input, target in dataset:def closure():optimizer.zero_grad()output = model(input)loss = loss_fn(output, target)loss.backward()return lossoptimizer.step(closure)
3. Optimizer 基类
torch.optim.Optimizer(params, defaults)
3.1 Optimizer 的参数
**params**:可迭代对象,需要被优化的参数对象,一般为张量(torch.Tensor)或字典(dict)。**defaults**:字典类型,一些优化选项,基本都有默认值。
3.2 Optimizer 的属性
class Optimizer(Object):def __init__(self,defaults):self.defaults = defaultsself.state = defaultdict(dict)self.param_groups = [{'params':param_groups}]
**defaults**:优化器超参数。**state**:参数的缓存,如 momentum 参数的缓存。**param_groups**:管理的参数组。**_step_count**:记录更新次数,学习率调整中使用。
3.3 Optimizer 的方法
1. zero_grad()
class Optimizer(Object):def zero_grad(self):for group in self.param_groups:for p in group['param']:if p.grad is not None:p.grad.detach_()p.grad.zero_()
zero_grad():清空优化器所管理参数的梯度;(Pytorch 特性:梯度张量不自动清零);
2. step(closure)
class Optimizer(Object):def __init__(self.params,defaults):self.defaults = defaultsself.state = defaultdict(dict)self.param_groups = []
step():执行一步更新;
3. add_param_group(param_group)
class Optimizer(Object):def add_param_group(self.param_group):for group in self.param_groups:param_set_update(set(group['params']))set_param_groups.append(param_group)
add_param_group():添加参数组;增加需要优化的参数到 **param_groups**,如在使用预训练模型进行微调时,很有用,可以将冻结层参数添加到训练中。
4. state_dict()
state_dict():获取优化器当前状态信息字典;
返回优化器状态,字典类型,包括优化器状态和参数组
5. load_state_dict(state_dict)
class Optimizer(Object):def state_dict(self):return {'state':packed_state,'param_groups':param_groups}def load_state_dict(self,state_dict):
load_state_dict():加载状态信息字典;加载优化器参数。
3. Optimizer
torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0)torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)torch.optim.SparseAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08)torch.optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)torch.optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False)
这些 Optimizer 部分,我罗列的比较简单,因为在应用层,无非是他们的参数,而这些参数就关乎算法原理,不是本文重点,有兴趣可以参见梯度下降算法原理的博客
4. 学习率调节
这些优化器中往往需要多个参数,共同控制才能达到优化目的,但大多数参数都有默认参考值,这些值都是大牛们经过多方验证得出的,所以我们在训练模型时,需要手动设置的参数并不多。
其中最需要我们手动调节的就是学习率,关于学习率衰减理论部分可参见个人博客;而 Pytorch 中怎么调用呢?
torch.optim.lr_scheduler
提供了基于 epochs 调节学习率的方法;主要有以下几种:
torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_epoch=-1)torch.optim.lr_scheduler.StepLR(optimizer, step_size, gamma=0.1, last_epoch=-1)torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, last_epoch=-1)torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma, last_epoch=-1)torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max, eta_min=0, last_epoch=-1)torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10, verbose=False, threshold=0.0001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08)
Example:
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)scheduler = ReduceLROnPlateau(optimizer, 'min')for epoch in range(10):train(...)val_loss = validate(...)# Note that step should be called after validate()scheduler.step(val_loss)
5. torch.optim.Adam
class torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
参数:
**params (iterable)**:待优化参数的iterable或者是定义了参数组的dict。**lr (float, 可选)**:学习率(默认:1e-3)**betas (Tuple[float, float], 可选)**:用于计算梯度以及梯度平方的运行平均值的系数(默认:0.9,0.999)**eps (float, 可选)**:为了增加数值计算的稳定性而加到分母里的项(默认:1e-8)**weight_decay (float, 可选)**:权重衰减(L2惩罚)(默认:0)
个人理解:
**lr**:同样也称为学习率或步长因子,它控制了权重的更新比率(如0.001)。较大的值(如0.3)在学习率更新前会有更快的初始学习,而较小的值(如1.0E-5)会令训练收敛到更好的性能。**betas = (beta1,beta2)****beta1**:一阶矩估计的指数衰减率(如0.9)。**beta2**:二阶矩估计的指数衰减率(如0.999)。该超参数在稀疏梯度(如在NLP或计算机视觉任务中)中应该设置为接近1的数。
**eps**:epsilon,该参数是非常小的数,其为了防止在实现中除以零(如10E-8)。
可结合官方文档中的参数说明和我的个人理解掌握该函数的用法。
源码:
import torchfrom . import _functional as Ffrom .optimizer import Optimizer[docs]class Adam(Optimizer):r"""Implements Adam algorithm.It has been proposed in `Adam: A Method for Stochastic Optimization`_.The implementation of the L2 penalty follows changes proposed in`Decoupled Weight Decay Regularization`_.Args:params (iterable): iterable of parameters to optimize or dicts definingparameter groupslr (float, optional): learning rate (default: 1e-3)betas (Tuple[float, float], optional): coefficients used for computingrunning averages of gradient and its square (default: (0.9, 0.999))eps (float, optional): term added to the denominator to improvenumerical stability (default: 1e-8)weight_decay (float, optional): weight decay (L2 penalty) (default: 0)amsgrad (boolean, optional): whether to use the AMSGrad variant of thisalgorithm from the paper `On the Convergence of Adam and Beyond`_(default: False).. _Adam\: A Method for Stochastic Optimization:https://arxiv.org/abs/1412.6980.. _Decoupled Weight Decay Regularization:https://arxiv.org/abs/1711.05101.. _On the Convergence of Adam and Beyond:https://openreview.net/forum?id=ryQu7f-RZ"""def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,weight_decay=0, amsgrad=False):if not 0.0 <= lr:raise ValueError("Invalid learning rate: {}".format(lr))if not 0.0 <= eps:raise ValueError("Invalid epsilon value: {}".format(eps))if not 0.0 <= betas[0] < 1.0:raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0]))if not 0.0 <= betas[1] < 1.0:raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1]))if not 0.0 <= weight_decay:raise ValueError("Invalid weight_decay value: {}".format(weight_decay))defaults = dict(lr=lr, betas=betas, eps=eps,weight_decay=weight_decay, amsgrad=amsgrad)super(Adam, self).__init__(params, defaults)def __setstate__(self, state):super(Adam, self).__setstate__(state)for group in self.param_groups:group.setdefault('amsgrad', False)[docs] @torch.no_grad()def step(self, closure=None):"""Performs a single optimization step.Args:closure (callable, optional): A closure that reevaluates the modeland returns the loss."""loss = Noneif closure is not None:with torch.enable_grad():loss = closure()for group in self.param_groups:params_with_grad = []grads = []exp_avgs = []exp_avg_sqs = []max_exp_avg_sqs = []state_steps = []beta1, beta2 = group['betas']for p in group['params']:if p.grad is not None:params_with_grad.append(p)if p.grad.is_sparse:raise RuntimeError('Adam does not support sparse gradients, please consider SparseAdam instead')grads.append(p.grad)state = self.state[p]# Lazy state initializationif len(state) == 0:state['step'] = 0# Exponential moving average of gradient valuesstate['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)# Exponential moving average of squared gradient valuesstate['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)if group['amsgrad']:# Maintains max of all exp. moving avg. of sq. grad. valuesstate['max_exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)exp_avgs.append(state['exp_avg'])exp_avg_sqs.append(state['exp_avg_sq'])if group['amsgrad']:max_exp_avg_sqs.append(state['max_exp_avg_sq'])# update the steps for each param group updatestate['step'] += 1# record the step after step updatestate_steps.append(state['step'])F.adam(params_with_grad,grads,exp_avgs,exp_avg_sqs,max_exp_avg_sqs,state_steps,amsgrad=group['amsgrad'],beta1=beta1,beta2=beta2,lr=group['lr'],weight_decay=group['weight_decay'],eps=group['eps'])return loss
6. CPN Optimizer
实例化
Adam优化器optimizer = torch.optim.Adam(model.parameters(),lr = cfg.lr,weight_decay=cfg.weight_decay)
若
resume,加载优化器状态if args.resume:if isfile(args.resume):print("=> loading checkpoint '{}'".format(args.resume))checkpoint = torch.load(args.resume)pretrained_dict = checkpoint['state_dict']model.load_state_dict(pretrained_dict)args.start_epoch = checkpoint['epoch']optimizer.load_state_dict(checkpoint['optimizer'])print("=> loaded checkpoint '{}' (epoch {})".format(args.resume, checkpoint['epoch']))logger = Logger(join(args.checkpoint, 'log.txt'), resume=True)else:print("=> no checkpoint found at '{}'".format(args.resume))else:logger = Logger(join(args.checkpoint, 'log.txt'))logger.set_names(['Epoch', 'LR', 'Train Loss'])
训练时(学习率调节,train,优化器迭代,模型保存)
for epoch in range(args.start_epoch, args.epochs):#调节学习率lr = adjust_learning_rate(optimizer, epoch, cfg.lr_dec_epoch, cfg.lr_gamma)print('\nEpoch: %d | LR: %.8f' % (epoch + 1, lr))# train for one epoch#将优化器给train函数train_loss = train(train_loader, model, [criterion1, criterion2], optimizer)print('train_loss: ',train_loss)# append logger filelogger.append([epoch + 1, lr, train_loss])#保存模型,也保存优化器状态save_model({'epoch': epoch + 1,'state_dict': model.state_dict(),'optimizer' : optimizer.state_dict(),}, checkpoint=args.checkpoint)
细节1:学习率调节函数
**adjust_learning_rate**def adjust_learning_rate(optimizer, epoch, schedule, gamma):"""Sets the learning rate to the initial LR decayed by schedule"""if epoch in schedule:for param_group in optimizer.param_groups:param_group['lr'] *= gammareturn optimizer.state_dict()['param_groups'][0]['lr']
细节2:
train函数,优化器迭代# compute gradient and do Optimization stepoptimizer.zero_grad()loss.backward()optimizer.step()
