参考来源:
CSDN:Pytorch模型训练(5) - Optimizer
CSDN:Pyotrch —— 优化器 Optimizer(一)
CSDN:pytorch 中 torch.optim.Adam 方法的使用和参数的解释
本文总结 **Pytorch**
中的 **Optimizer**
。Optimizer
是深度学习模型训练中非常重要的一个模块,它决定参数参数更新的方向,快慢和大小,好的 Optimizer
算法和合适的参数使得模型收敛又快又准。
但本文不会讨论什么任务用什么 Optimizer
,及其参数设置,只是总结下 Pytorch
中的 Optimizer
。
1. 什么是优化器
**Pytorch**
优化器:管理并更新模型中可学习参数的值,使得模型输出更接近真实标签;管理是指优化器管理和修改参数,更新是指优化器的优化策略。优化策略通常采用梯度下降,梯度是一个向量,梯度的方向是使得方向导数最大。
2. torch.optim
Pytorch
的 **torch.optim**
是包含各种优化器算法的包,支持通用优化器方法,接口通用性好,也方便集成更加复杂的算法。
怎样使用一个 **Optimizer**
???
要使用 Optimizer
,我们首先要创建一个 Optimizer
对象,该对象会保持当前状态,并根据计算梯度来更新参数。
2.1 创建 Optimizer
创建 Optimizer
时,需要为其提供一些需要迭代的参数进行迭代,还可以指定一些可选的,特定的,用于优化的参数,如学习率,权重衰减等参数。
Example:
optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9)
optimizer = optim.Adam([var1, var2], lr = 0.0001)
注意1:如果需要将模型移到 GPU
,可以通过 “.cuda
“ 来实现。
optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9).cuda()
optimizer = optim.Adam([var1, var2], lr = 0.0001).cuda()
注意2:在训练中,最好保持模型和优化在相同位置,即要在 GPU
,都在 GPU
。
2.2 Optimizer
参数
Optimizer
支持特殊参数指定选项,这样需要用一个字典(dict
)类型的可迭代参数代替变量(Variable
)可迭代参数;它们有各自的参数组,用 “params
“ 关键字将他们独立开(包含属于它的参数列表)。
在需要不同层不同参数时,非常有用,如:
optim.SGD([
{'params': model.base.parameters()},
{'params': model.classifier.parameters(), 'lr': 1e-3}
], lr=1e-2, momentum=0.9)
也就是,classifier.parameters
的学习率为 1e-3
,base.parameters
的学习率为 le-2
,动量 0.9
适用所有参数。
2.3 Optimizer
迭代
迭代,更新参数,一般有下面 2 种方式:
方式 1 :
optimizer.step()
该方式能满足大多需求,一般只要进行梯度需要,如 backward()
,这个 step()
函数就需要被召唤
Example:
for input, target in dataset:
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
方式 2 :
optimizer.step(closure)
一些特殊算法,如共轭梯度(Conjugate Gradient
) 和 LBFGS
需要多次重新评估函数,所以需要传入一个允许重新计算模型的闭包(closure
),来清理梯度,计算 loss
并返回。
Example:
for input, target in dataset:
def closure():
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
return loss
optimizer.step(closure)
3. Optimizer
基类
torch.optim.Optimizer(params, defaults)
3.1 Optimizer
的参数
**params**
:可迭代对象,需要被优化的参数对象,一般为张量(torch.Tensor
)或字典(dict
)。**defaults**
:字典类型,一些优化选项,基本都有默认值。
3.2 Optimizer
的属性
class Optimizer(Object):
def __init__(self,defaults):
self.defaults = defaults
self.state = defaultdict(dict)
self.param_groups = [{'params':param_groups}]
**defaults**
:优化器超参数。**state**
:参数的缓存,如 momentum
参数的缓存。**param_groups**
:管理的参数组。**_step_count**
:记录更新次数,学习率调整中使用。
3.3 Optimizer
的方法
1. zero_grad()
class Optimizer(Object):
def zero_grad(self):
for group in self.param_groups:
for p in group['param']:
if p.grad is not None:
p.grad.detach_()
p.grad.zero_()
zero_grad()
:清空优化器所管理参数的梯度;(Pytorch 特性:梯度张量不自动清零);
2. step(closure)
class Optimizer(Object):
def __init__(self.params,defaults):
self.defaults = defaults
self.state = defaultdict(dict)
self.param_groups = []
step()
:执行一步更新;
3. add_param_group(param_group)
class Optimizer(Object):
def add_param_group(self.param_group):
for group in self.param_groups:
param_set_update(set(group['params']))
set_param_groups.append(param_group)
add_param_group()
:添加参数组;增加需要优化的参数到 **param_groups**
,如在使用预训练模型进行微调时,很有用,可以将冻结层参数添加到训练中。
4. state_dict()
state_dict()
:获取优化器当前状态信息字典;
返回优化器状态,字典类型,包括优化器状态和参数组
5. load_state_dict(state_dict)
class Optimizer(Object):
def state_dict(self):
return {'state':packed_state,'param_groups':param_groups}
def load_state_dict(self,state_dict):
load_state_dict()
:加载状态信息字典;加载优化器参数。
3. Optimizer
torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)
torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0)
torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
torch.optim.SparseAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08)
torch.optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)
torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)
torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)
torch.optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))
torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False)
这些 Optimizer
部分,我罗列的比较简单,因为在应用层,无非是他们的参数,而这些参数就关乎算法原理,不是本文重点,有兴趣可以参见梯度下降算法原理的博客
4. 学习率调节
这些优化器中往往需要多个参数,共同控制才能达到优化目的,但大多数参数都有默认参考值,这些值都是大牛们经过多方验证得出的,所以我们在训练模型时,需要手动设置的参数并不多。
其中最需要我们手动调节的就是学习率,关于学习率衰减理论部分可参见个人博客;而 Pytorch
中怎么调用呢?
torch.optim.lr_scheduler
提供了基于 epochs
调节学习率的方法;主要有以下几种:
torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_epoch=-1)
torch.optim.lr_scheduler.StepLR(optimizer, step_size, gamma=0.1, last_epoch=-1)
torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, last_epoch=-1)
torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma, last_epoch=-1)
torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max, eta_min=0, last_epoch=-1)
torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10, verbose=False, threshold=0.0001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08)
Example:
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min')
for epoch in range(10):
train(...)
val_loss = validate(...)
# Note that step should be called after validate()
scheduler.step(val_loss)
5. torch.optim.Adam
class torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
参数:
**params (iterable)**
:待优化参数的iterable
或者是定义了参数组的dict
。**lr (float, 可选)**
:学习率(默认:1e-3
)**betas (Tuple[float, float], 可选)**
:用于计算梯度以及梯度平方的运行平均值的系数(默认:0.9,0.999
)**eps (float, 可选)**
:为了增加数值计算的稳定性而加到分母里的项(默认:1e-8
)**weight_decay (float, 可选)**
:权重衰减(L2
惩罚)(默认:0
)
个人理解:
**lr**
:同样也称为学习率或步长因子,它控制了权重的更新比率(如0.001
)。较大的值(如0.3
)在学习率更新前会有更快的初始学习,而较小的值(如1.0E-5
)会令训练收敛到更好的性能。**betas = (beta1,beta2)**
**beta1**
:一阶矩估计的指数衰减率(如0.9
)。**beta2**
:二阶矩估计的指数衰减率(如0.999
)。该超参数在稀疏梯度(如在NLP
或计算机视觉任务中)中应该设置为接近1
的数。
**eps**
:epsilon
,该参数是非常小的数,其为了防止在实现中除以零(如10E-8
)。
可结合官方文档中的参数说明和我的个人理解掌握该函数的用法。
源码:
import torch
from . import _functional as F
from .optimizer import Optimizer
[docs]class Adam(Optimizer):
r"""Implements Adam algorithm.
It has been proposed in `Adam: A Method for Stochastic Optimization`_.
The implementation of the L2 penalty follows changes proposed in
`Decoupled Weight Decay Regularization`_.
Args:
params (iterable): iterable of parameters to optimize or dicts defining
parameter groups
lr (float, optional): learning rate (default: 1e-3)
betas (Tuple[float, float], optional): coefficients used for computing
running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional): term added to the denominator to improve
numerical stability (default: 1e-8)
weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
amsgrad (boolean, optional): whether to use the AMSGrad variant of this
algorithm from the paper `On the Convergence of Adam and Beyond`_
(default: False)
.. _Adam\: A Method for Stochastic Optimization:
https://arxiv.org/abs/1412.6980
.. _Decoupled Weight Decay Regularization:
https://arxiv.org/abs/1711.05101
.. _On the Convergence of Adam and Beyond:
https://openreview.net/forum?id=ryQu7f-RZ
"""
def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,
weight_decay=0, amsgrad=False):
if not 0.0 <= lr:
raise ValueError("Invalid learning rate: {}".format(lr))
if not 0.0 <= eps:
raise ValueError("Invalid epsilon value: {}".format(eps))
if not 0.0 <= betas[0] < 1.0:
raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0]))
if not 0.0 <= betas[1] < 1.0:
raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1]))
if not 0.0 <= weight_decay:
raise ValueError("Invalid weight_decay value: {}".format(weight_decay))
defaults = dict(lr=lr, betas=betas, eps=eps,
weight_decay=weight_decay, amsgrad=amsgrad)
super(Adam, self).__init__(params, defaults)
def __setstate__(self, state):
super(Adam, self).__setstate__(state)
for group in self.param_groups:
group.setdefault('amsgrad', False)
[docs] @torch.no_grad()
def step(self, closure=None):
"""Performs a single optimization step.
Args:
closure (callable, optional): A closure that reevaluates the model
and returns the loss.
"""
loss = None
if closure is not None:
with torch.enable_grad():
loss = closure()
for group in self.param_groups:
params_with_grad = []
grads = []
exp_avgs = []
exp_avg_sqs = []
max_exp_avg_sqs = []
state_steps = []
beta1, beta2 = group['betas']
for p in group['params']:
if p.grad is not None:
params_with_grad.append(p)
if p.grad.is_sparse:
raise RuntimeError('Adam does not support sparse gradients, please consider SparseAdam instead')
grads.append(p.grad)
state = self.state[p]
# Lazy state initialization
if len(state) == 0:
state['step'] = 0
# Exponential moving average of gradient values
state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)
# Exponential moving average of squared gradient values
state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
if group['amsgrad']:
# Maintains max of all exp. moving avg. of sq. grad. values
state['max_exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
exp_avgs.append(state['exp_avg'])
exp_avg_sqs.append(state['exp_avg_sq'])
if group['amsgrad']:
max_exp_avg_sqs.append(state['max_exp_avg_sq'])
# update the steps for each param group update
state['step'] += 1
# record the step after step update
state_steps.append(state['step'])
F.adam(params_with_grad,
grads,
exp_avgs,
exp_avg_sqs,
max_exp_avg_sqs,
state_steps,
amsgrad=group['amsgrad'],
beta1=beta1,
beta2=beta2,
lr=group['lr'],
weight_decay=group['weight_decay'],
eps=group['eps'])
return loss
6. CPN Optimizer
实例化
Adam
优化器optimizer = torch.optim.Adam(model.parameters(),
lr = cfg.lr,
weight_decay=cfg.weight_decay)
若
resume
,加载优化器状态if args.resume:
if isfile(args.resume):
print("=> loading checkpoint '{}'".format(args.resume))
checkpoint = torch.load(args.resume)
pretrained_dict = checkpoint['state_dict']
model.load_state_dict(pretrained_dict)
args.start_epoch = checkpoint['epoch']
optimizer.load_state_dict(checkpoint['optimizer'])
print("=> loaded checkpoint '{}' (epoch {})"
.format(args.resume, checkpoint['epoch']))
logger = Logger(join(args.checkpoint, 'log.txt'), resume=True)
else:
print("=> no checkpoint found at '{}'".format(args.resume))
else:
logger = Logger(join(args.checkpoint, 'log.txt'))
logger.set_names(['Epoch', 'LR', 'Train Loss'])
训练时(学习率调节,train,优化器迭代,模型保存)
for epoch in range(args.start_epoch, args.epochs):
#调节学习率
lr = adjust_learning_rate(optimizer, epoch, cfg.lr_dec_epoch, cfg.lr_gamma)
print('\nEpoch: %d | LR: %.8f' % (epoch + 1, lr))
# train for one epoch
#将优化器给train函数
train_loss = train(train_loader, model, [criterion1, criterion2], optimizer)
print('train_loss: ',train_loss)
# append logger file
logger.append([epoch + 1, lr, train_loss])
#保存模型,也保存优化器状态
save_model({
'epoch': epoch + 1,
'state_dict': model.state_dict(),
'optimizer' : optimizer.state_dict(),
}, checkpoint=args.checkpoint)
细节1:学习率调节函数
**adjust_learning_rate**
def adjust_learning_rate(optimizer, epoch, schedule, gamma):
"""Sets the learning rate to the initial LR decayed by schedule"""
if epoch in schedule:
for param_group in optimizer.param_groups:
param_group['lr'] *= gamma
return optimizer.state_dict()['param_groups'][0]['lr']
细节2:
train
函数,优化器迭代# compute gradient and do Optimization step
optimizer.zero_grad()
loss.backward()
optimizer.step()