PyTorch
本文代码基于 PyTorch 1.0 版本,需要用到以下包

  1. import collections
  2. import os
  3. import shutil
  4. import tqdm
  5. import numpy as np
  6. import PIL.Image
  7. import torch
  8. import torchvision

基础配置

检查 PyTorch 版本

  1. torch.__version__ # PyTorch version
  2. torch.version.cuda # Corresponding CUDA version
  3. torch.backends.cudnn.version() # Corresponding cuDNN version
  4. torch.cuda.get_device_name(0) # GPU type

更新 PyTorch

PyTorch 将被安装在 anaconda3/lib/python3.7/site-packages/torch/目录下。

  1. conda update pytorch torchvision -c pytorch

固定随机种子

  1. torch.manual_seed(0)
  2. torch.cuda.manual_seed_all(0)

指定程序运行在特定 GPU 卡上

在命令行指定环境变量

  1. CUDA_VISIBLE_DEVICES=0,1 python train.py

或在代码中指定

  1. os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'

判断是否有 CUDA 支持

  1. torch.cuda.is_available()

设置为 cuDNN benchmark 模式

Benchmark 模式会提升计算速度,但是由于计算中有随机性,每次网络前馈结果略有差异。

  1. torch.backends.cudnn.benchmark = True

如果想要避免这种结果波动,设置

  1. torch.backends.cudnn.deterministic = True

清除 GPU 存储

有时 Control-C 中止运行后 GPU 存储没有及时释放,需要手动清空。在 PyTorch 内部可以

  1. torch.cuda.empty_cache()

或在命令行可以先使用 ps 找到程序的 PID,再使用 kill 结束该进程

  1. ps aux | grep pythonkill -9 [pid]

或者直接重置没有被清空的 GPU

  1. nvidia-smi --gpu-reset -i [gpu_id]

张量处理

张量基本信息

  1. tensor.type() # Data type
  2. tensor.size() # Shape of the tensor. It is a subclass of Python tuple
  3. tensor.dim() # Number of dimensions.

数据类型转换

  1. # Set default tensor type. Float in PyTorch is much faster than double.
  2. torch.set_default_tensor_type(torch.FloatTensor)
  3. # Type convertions.
  4. tensor = tensor.cuda()
  5. tensor = tensor.cpu()
  6. tensor = tensor.float()
  7. tensor = tensor.long()

torch.Tensornp.ndarray 转换

  1. # torch.Tensor -> np.ndarray.
  2. ndarray = tensor.cpu().numpy()
  3. # np.ndarray -> torch.Tensor.
  4. tensor = torch.from_numpy(ndarray).float()
  5. tensor = torch.from_numpy(ndarray.copy()).float() # If ndarray has negative stride

torch.TensorPIL.Image 转换

PyTorch 中的张量默认采用 N×D×H×W 的顺序,并且数据范围在 [0, 1],需要进行转置和规范化。

  1. # torch.Tensor -> PIL.Image.
  2. image = PIL.Image.fromarray(torch.clamp(tensor * 255, min=0, max=255
  3. ).byte().permute(1, 2, 0).cpu().numpy())
  4. image = torchvision.transforms.functional.to_pil_image(tensor) # Equivalently way
  5. # PIL.Image -> torch.Tensor.
  6. tensor = torch.from_numpy(np.asarray(PIL.Image.open(path))
  7. ).permute(2, 0, 1).float() / 255
  8. tensor = torchvision.transforms.functional.to_tensor(PIL.Image.open(path)) # Equivalently way

np.ndarrayPIL.Image 转换

  1. # np.ndarray -> PIL.Image.
  2. image = PIL.Image.fromarray(ndarray.astypde(np.uint8))
  3. # PIL.Image -> np.ndarray.
  4. ndarray = np.asarray(PIL.Image.open(path))

从只包含一个元素的张量中提取值

这在训练时统计 loss 的变化过程中特别有用。否则这将累积计算图,使 GPU 存储占用量越来越大。

  1. value = tensor.item()

张量形变

张量形变常常需要用于将卷积层特征输入全连接层的情形。相比 torch.viewtorch.reshape 可以自动处理输入张量不连续的情况。

  1. tensor = torch.reshape(tensor, shape)

打乱顺序

  1. tensor = tensor[torch.randperm(tensor.size(0))] # Shuffle the first dimension

水平翻转

PyTorch 不支持 tensor[::-1] 这样的负步长操作,水平翻转可以用张量索引实现。

  1. # Assume tensor has shape N*D*H*W.tensor = tensor[:, :, :, torch.arange(tensor.size(3) - 1, -1, -1).long()]

复制张量

有三种复制的方式,对应不同的需求。

  1. # Operation | New/Shared memory | Still in computation graph |
  2. tensor.clone() # | New | Yes |
  3. tensor.detach() # | Shared | No |
  4. tensor.detach.clone()() # | New | No |

拼接张量

注意 torch.cattorch.stack 的区别在于 torch.cat 沿着给定的维度拼接,而 torch.stack 会新增一维。例如当参数是 3 个 10×5 的张量,torch.cat 的结果是 30×5 的张量,而 torch.stack 的结果是 3×10×5 的张量。

  1. tensor = torch.cat(list_of_tensors, dim=0)
  2. tensor = torch.stack(list_of_tensors, dim=0)

将整数标记转换成独热(one-hot)编码

PyTorch 中的标记默认从 0 开始。

  1. N = tensor.size(0)
  2. one_hot = torch.zeros(N, num_classes).long()
  3. one_hot.scatter_(dim=1, index=torch.unsqueeze(tensor, dim=1), src=torch.ones(N, num_classes).long())

得到非零/零元素

  1. torch.nonzero(tensor) # Index of non-zero elements
  2. torch.nonzero(tensor == 0) # Index of zero elements
  3. torch.nonzero(tensor).size(0) # Number of non-zero elements
  4. torch.nonzero(tensor == 0).size(0) # Number of zero elements

张量扩展

  1. # Expand tensor of shape 64*512 to shape 64*512*7*7.
  2. torch.reshape(tensor, (64, 512, 1, 1)).expand(64, 512, 7, 7)

矩阵乘法

  1. # Matrix multiplication: (m*n) * (n*p) -> (m*p).
  2. result = torch.mm(tensor1, tensor2)
  3. # Batch matrix multiplication: (b*m*n) * (b*n*p) -> (b*m*p).
  4. result = torch.bmm(tensor1, tensor2)
  5. # Element-wise multiplication.
  6. result = tensor1 * tensor2

计算两组数据之间的两两欧式距离

  1. # X1 is of shape m*d.
  2. X1 = torch.unsqueeze(X1, dim=1).expand(m, n, d)
  3. # X2 is of shape n*d.
  4. X2 = torch.unsqueeze(X2, dim=0).expand(m, n, d)
  5. # dist is of shape m*n, where dist[i][j] = sqrt(|X1[i, :] - X[j, :]|^2)
  6. dist = torch.sqrt(torch.sum((X1 - X2) ** 2, dim=2))

模型定义

卷积层

最常用的卷积层配置是

  1. conv = torch.nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=True)conv = torch.nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1, padding=0, bias=True)

如果卷积层配置比较复杂,不方便计算输出大小时,可以利用如下可视化工具辅助
链接:https://ezyang.github.io/convolution-visualizer/index.html

0GAP(Global average pooling)层

  1. gap = torch.nn.AdaptiveAvgPool2d(output_size=1)

双线性汇合(bilinear pooling)

  1. X = torch.reshape(N, D, H * W) # Assume X has shape N*D*H*W
  2. X = torch.bmm(X, torch.transpose(X, 1, 2)) / (H * W) # Bilinear pooling
  3. assert X.size() == (N, D, D)
  4. X = torch.reshape(X, (N, D * D))
  5. X = torch.sign(X) * torch.sqrt(torch.abs(X) + 1e-5) # Signed-sqrt normalization
  6. X = torch.nn.functional.normalize(X) # L2 normalization

多卡同步 BN(Batch normalization)

当使用 torch.nn.DataParallel 将代码运行在多张 GPU 卡上时,PyTorch 的 BN 层默认操作是各卡上数据独立地计算均值和标准差,同步 BN 使用所有卡上的数据一起计算 BN 层的均值和标准差,缓解了当批量大小(batch size)比较小时对均值和标准差估计不准的情况,是在目标检测等任务中一个有效的提升性能的技巧。
链接:https://github.com/vacancy/Synchronized-BatchNorm-PyTorch

类似 BN 滑动平均

如果要实现类似 BN 滑动平均的操作,在 forward 函数中要使用原地(inplace)操作给滑动平均赋值。

  1. class BN(torch.nn.Module)
  2. def __init__(self):
  3. ...
  4. self.register_buffer('running_mean', torch.zeros(num_features))
  5. def forward(self, X):
  6. ...
  7. self.running_mean += momentum * (current - self.running_mean)

计算模型整体参数量

  1. num_parameters = sum(torch.numel(parameter) for parameter in model.parameters())

类似 Keras 的 model.summary() 输出模型信息

链接:https://github.com/sksq96/pytorch-summary

模型权值初始化

注意 model.modules()model.children() 的区别:model.modules() 会迭代地遍历模型的所有子层,而 model.children() 只会遍历模型下的一层。

  1. # Common practise for initialization.
  2. for layer in model.modules():
  3. if isinstance(layer, torch.nn.Conv2d):
  4. torch.nn.init.kaiming_normal_(layer.weight, mode='fan_out',
  5. nonlinearity='relu')
  6. if layer.bias is not None:
  7. torch.nn.init.constant_(layer.bias, val=0.0)
  8. elif isinstance(layer, torch.nn.BatchNorm2d):
  9. torch.nn.init.constant_(layer.weight, val=1.0)
  10. torch.nn.init.constant_(layer.bias, val=0.0)
  11. elif isinstance(layer, torch.nn.Linear):
  12. torch.nn.init.xavier_normal_(layer.weight)
  13. if layer.bias is not None:
  14. torch.nn.init.constant_(layer.bias, val=0.0)
  15. # Initialization with given tensor.
  16. layer.weight = torch.nn.Parameter(tensor)

部分层使用预训练模型

注意如果保存的模型是 torch.nn.DataParallel,则当前的模型也需要是

  1. model.load_state_dict(torch.load('model,pth'), strict=False)

将在 GPU 保存的模型加载到 CPU

  1. model.load_state_dict(torch.load('model,pth', map_location='cpu'))

数据准备、特征提取与微调

得到视频数据基本信息

  1. import cv2
  2. video = cv2.VideoCapture(mp4_path)
  3. height = int(video.get(cv2.CAP_PROP_FRAME_HEIGHT))
  4. width = int(video.get(cv2.CAP_PROP_FRAME_WIDTH))
  5. num_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
  6. fps = int(video.get(cv2.CAP_PROP_FPS))
  7. video.release()

TSN 每段(segment)采样一帧视频

  1. K = self._num_segments
  2. if is_train:
  3. if num_frames > K:
  4. # Random index for each segment.
  5. frame_indices = torch.randint(
  6. high=num_frames // K, size=(K,), dtype=torch.long)
  7. frame_indices += num_frames // K * torch.arange(K)
  8. else:
  9. frame_indices = torch.randint(
  10. high=num_frames, size=(K - num_frames,), dtype=torch.long)
  11. frame_indices = torch.sort(torch.cat((
  12. torch.arange(num_frames), frame_indices)))[0]
  13. else:
  14. if num_frames > K:
  15. # Middle index for each segment.
  16. frame_indices = num_frames / K // 2
  17. frame_indices += num_frames // K * torch.arange(K)
  18. else:
  19. frame_indices = torch.sort(torch.cat((
  20. torch.arange(num_frames), torch.arange(K - num_frames))))[0]
  21. assert frame_indices.size() == (K,)
  22. return [frame_indices[i] for i in range(K)]

提取 ImageNet 预训练模型某层的卷积特征

  1. # VGG-16 relu5-3 feature.
  2. model = torchvision.models.vgg16(pretrained=True).features[:-1]
  3. # VGG-16 pool5 feature.
  4. model = torchvision.models.vgg16(pretrained=True).features
  5. # VGG-16 fc7 feature.
  6. model = torchvision.models.vgg16(pretrained=True)
  7. model.classifier = torch.nn.Sequential(*list(model.classifier.children())[:-3])
  8. # ResNet GAP feature.
  9. model = torchvision.models.resnet18(pretrained=True)
  10. model = torch.nn.Sequential(collections.OrderedDict(
  11. list(model.named_children())[:-1]))
  12. with torch.no_grad():
  13. model.eval()
  14. conv_representation = model(image)

提取 ImageNet 预训练模型多层的卷积特征

  1. class FeatureExtractor(torch.nn.Module):
  2. """Helper class to extract several convolution features from the given
  3. pre-trained model.
  4. Attributes:
  5. _model, torch.nn.Module.
  6. _layers_to_extract, list<str> or set<str>
  7. Example:
  8. >>> model = torchvision.models.resnet152(pretrained=True)
  9. >>> model = torch.nn.Sequential(collections.OrderedDict(
  10. list(model.named_children())[:-1]))
  11. >>> conv_representation = FeatureExtractor(
  12. pretrained_model=model,
  13. layers_to_extract={'layer1', 'layer2', 'layer3', 'layer4'})(image)
  14. """
  15. def __init__(self, pretrained_model, layers_to_extract):
  16. torch.nn.Module.__init__(self)
  17. self._model = pretrained_model
  18. self._model.eval()
  19. self._layers_to_extract = set(layers_to_extract)
  20. def forward(self, x):
  21. with torch.no_grad():
  22. conv_representation = []
  23. for name, layer in self._model.named_children():
  24. x = layer(x)
  25. if name in self._layers_to_extract:
  26. conv_representation.append(x)
  27. return conv_representation

其他预训练模型

链接:https://github.com/Cadene/pretrained-models.pytorch

微调全连接层

  1. model = torchvision.models.resnet18(pretrained=True)
  2. for param in model.parameters():
  3. param.requires_grad = False
  4. model.fc = nn.Linear(512, 100) # Replace the last fc layer
  5. optimizer = torch.optim.SGD(model.fc.parameters(), lr=1e-2, momentum=0.9, weight_decay=1e-4)

以较大学习率微调全连接层,较小学习率微调卷积层

  1. model = torchvision.models.resnet18(pretrained=True)
  2. finetuned_parameters = list(map(id, model.fc.parameters()))
  3. conv_parameters = (p for p in model.parameters() if id(p) not in finetuned_parameters)
  4. parameters = [{'params': conv_parameters, 'lr': 1e-3},
  5. {'params': model.fc.parameters()}]
  6. optimizer = torch.optim.SGD(parameters, lr=1e-2, momentum=0.9, weight_decay=1e-4)

模型训练

常用训练和验证数据预处理

其中 ToTensor 操作会将 PIL.Image 或形状为 H×W×D,数值范围为 [0, 255] 的 np.ndarray 转换为形状为 D×H×W,数值范围为 [0.0, 1.0] 的 torch.Tensor

  1. train_transform = torchvision.transforms.Compose([
  2. torchvision.transforms.RandomResizedCrop(size=224,
  3. scale=(0.08, 1.0)),
  4. torchvision.transforms.RandomHorizontalFlip(),
  5. torchvision.transforms.ToTensor(),
  6. torchvision.transforms.Normalize(mean=(0.485, 0.456, 0.406),
  7. std=(0.229, 0.224, 0.225)),
  8. ])
  9. val_transform = torchvision.transforms.Compose([
  10. torchvision.transforms.Resize(224),
  11. torchvision.transforms.CenterCrop(224),
  12. torchvision.transforms.ToTensor(),
  13. torchvision.transforms.Normalize(mean=(0.485, 0.456, 0.406),
  14. std=(0.229, 0.224, 0.225)),
  15. ])

训练基本代码框架

  1. for t in epoch(80):
  2. for images, labels in tqdm.tqdm(train_loader, desc='Epoch %3d' % (t + 1)):
  3. images, labels = images.cuda(), labels.cuda()
  4. scores = model(images)
  5. loss = loss_function(scores, labels)
  6. optimizer.zero_grad()
  7. loss.backward()
  8. optimizer.step()

标记平滑(label smoothing)

  1. for images, labels in train_loader:
  2. images, labels = images.cuda(), labels.cuda()
  3. N = labels.size(0)
  4. # C is the number of classes.
  5. smoothed_labels = torch.full(size=(N, C), fill_value=0.1 / (C - 1)).cuda()
  6. smoothed_labels.scatter_(dim=1, index=torch.unsqueeze(labels, dim=1), value=0.9)
  7. score = model(images)
  8. log_prob = torch.nn.functional.log_softmax(score, dim=1)
  9. loss = -torch.sum(log_prob * smoothed_labels) / N
  10. optimizer.zero_grad()
  11. loss.backward()
  12. optimizer.step()

Mixup

  1. beta_distribution = torch.distributions.beta.Beta(alpha, alpha)
  2. for images, labels in train_loader:
  3. images, labels = images.cuda(), labels.cuda()
  4. # Mixup images.
  5. lambda_ = beta_distribution.sample([]).item()
  6. index = torch.randperm(images.size(0)).cuda()
  7. mixed_images = lambda_ * images + (1 - lambda_) * images[index, :]
  8. # Mixup loss.
  9. scores = model(mixed_images)
  10. loss = (lambda_ * loss_function(scores, labels)
  11. + (1 - lambda_) * loss_function(scores, labels[index]))
  12. optimizer.zero_grad()
  13. loss.backward()
  14. optimizer.step()

L1 正则化

  1. l1_regularization = torch.nn.L1Loss(reduction='sum')
  2. loss = ... # Standard cross-entropy loss
  3. for param in model.parameters():
  4. loss += torch.sum(torch.abs(param))
  5. loss.backward()

不对偏置项进行 L2 正则化/权值衰减(weight decay)

  1. bias_list = (param for name, param in model.named_parameters() if name[-4:] == 'bias')
  2. others_list = (param for name, param in model.named_parameters() if name[-4:] != 'bias')
  3. parameters = [{'parameters': bias_list, 'weight_decay': 0},
  4. {'parameters': others_list}]
  5. optimizer = torch.optim.SGD(parameters, lr=1e-2, momentum=0.9, weight_decay=1e-4)

梯度裁剪(gradient clipping)

  1. torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=20)

计算 Softmax 输出的准确率

  1. score = model(images)
  2. prediction = torch.argmax(score, dim=1)
  3. num_correct = torch.sum(prediction == labels).item()
  4. accuruacy = num_correct / labels.size(0)

可视化模型前馈的计算图

链接:https://github.com/szagoruyko/pytorchviz

可视化学习曲线

有 Facebook 自己开发的 Visdom 和 Tensorboard 两个选择。
https://github.com/facebookresearch/visdom
https://github.com/lanpa/tensorboardX

  1. # Example using Visdom.
  2. vis = visdom.Visdom(env='Learning curve', use_incoming_socket=False)
  3. assert self._visdom.check_connection()
  4. self._visdom.close()
  5. options = collections.namedtuple('Options', ['loss', 'acc', 'lr'])(
  6. loss={'xlabel': 'Epoch', 'ylabel': 'Loss', 'showlegend': True},
  7. acc={'xlabel': 'Epoch', 'ylabel': 'Accuracy', 'showlegend': True},
  8. lr={'xlabel': 'Epoch', 'ylabel': 'Learning rate', 'showlegend': True})
  9. for t in epoch(80):
  10. tran(...)
  11. val(...)
  12. vis.line(X=torch.Tensor([t + 1]), Y=torch.Tensor([train_loss]),
  13. name='train', win='Loss', update='append', opts=options.loss)
  14. vis.line(X=torch.Tensor([t + 1]), Y=torch.Tensor([val_loss]),
  15. name='val', win='Loss', update='append', opts=options.loss)
  16. vis.line(X=torch.Tensor([t + 1]), Y=torch.Tensor([train_acc]),
  17. name='train', win='Accuracy', update='append', opts=options.acc)
  18. vis.line(X=torch.Tensor([t + 1]), Y=torch.Tensor([val_acc]),
  19. name='val', win='Accuracy', update='append', opts=options.acc)
  20. vis.line(X=torch.Tensor([t + 1]), Y=torch.Tensor([lr]),
  21. win='Learning rate', update='append', opts=options.lr)

得到当前学习率

  1. # If there is one global learning rate (which is the common case).
  2. lr = next(iter(optimizer.param_groups))['lr']
  3. # If there are multiple learning rates for different layers.
  4. all_lr = []
  5. for param_group in optimizer.param_groups:
  6. all_lr.append(param_group['lr'])

学习率衰减

  1. # Reduce learning rate when validation accuarcy plateau.
  2. scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', patience=5, verbose=True)
  3. for t in range(0, 80):
  4. train(...); val(...)
  5. scheduler.step(val_acc)
  6. # Cosine annealing learning rate.
  7. scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=80)
  8. # Reduce learning rate by 10 at given epochs.
  9. scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[50, 70], gamma=0.1)
  10. for t in range(0, 80):
  11. scheduler.step()
  12. train(...); val(...)
  13. # Learning rate warmup by 10 epochs.
  14. scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda t: t / 10)
  15. for t in range(0, 10):
  16. scheduler.step()
  17. train(...); val(...)

保存与加载断点

注意为了能够恢复训练,需要同时保存模型和优化器的状态,以及当前的训练轮数。

  1. # Save checkpoint.
  2. is_best = current_acc > best_acc
  3. best_acc = max(best_acc, current_acc)
  4. checkpoint = {
  5. 'best_acc': best_acc,
  6. 'epoch': t + 1,
  7. 'model': model.state_dict(),
  8. 'optimizer': optimizer.state_dict(),
  9. }
  10. model_path = os.path.join('model', 'checkpoint.pth.tar')
  11. torch.save(checkpoint, model_path)
  12. if is_best:
  13. shutil.copy('checkpoint.pth.tar', model_path)
  14. # Load checkpoint.
  15. if resume:
  16. model_path = os.path.join('model', 'checkpoint.pth.tar')
  17. assert os.path.isfile(model_path)
  18. checkpoint = torch.load(model_path)
  19. best_acc = checkpoint['best_acc']
  20. start_epoch = checkpoint['epoch']
  21. model.load_state_dict(checkpoint['model'])
  22. optimizer.load_state_dict(checkpoint['optimizer'])
  23. print('Load checkpoint at epoch %d.' % start_epoch)

计算准确率、查准率(precision)、查全率(recall)

  1. # data['label'] and data['prediction'] are groundtruth label and prediction
  2. # for each image, respectively.
  3. accuracy = np.mean(data['label'] == data['prediction']) * 100
  4. # Compute recision and recall for each class.
  5. for c in range(len(num_classes)):
  6. tp = np.dot((data['label'] == c).astype(int),
  7. (data['prediction'] == c).astype(int))
  8. tp_fp = np.sum(data['prediction'] == c)
  9. tp_fn = np.sum(data['label'] == c)
  10. precision = tp / tp_fp * 100
  11. recall = tp / tp_fn * 100

PyTorch 其他注意事项

模型定义

  • 建议有参数的层和汇合(pooling)层使用 torch.nn 模块定义,激活函数直接使用 torch.nn.functionaltorch.nn 模块和 torch.nn.functional 的区别在于,torch.nn 模块在计算时底层调用了 torch.nn.functional,但 torch.nn 模块包括该层参数,还可以应对训练和测试两种网络状态。使用 torch.nn.functional 时要注意网络状态,如

    1. def forward(self, x):
    2. ...
    3. x = torch.nn.functional.dropout(x, p=0.5, training=self.training)
  • model(x) 前用 model.train()model.eval() 切换网络状态。

  • 不需要计算梯度的代码块用 with torch.no_grad() 包含起来。model.eval()torch.no_grad() 的区别在于,model.eval() 是将网络切换为测试状态,例如 BN 和随机失活(dropout)在训练和测试阶段使用不同的计算方法。torch.no_grad() 是关闭 PyTorch 张量的自动求导机制,以减少存储使用和加速计算,得到的结果无法进行 loss.backward()
  • torch.nn.CrossEntropyLoss 的输入不需要经过 Softmax。torch.nn.CrossEntropyLoss 等价于 torch.nn.functional.log_softmax + torch.nn.NLLLoss
  • loss.backward() 前用 optimizer.zero_grad() 清除累积梯度。optimizer.zero_grad()model.zero_grad() 效果一样。

    PyTorch 性能与调试

  • torch.utils.data.DataLoader 中尽量设置 pin_memory=True,对特别小的数据集如 MNIST 设置 pin_memory=False 反而更快一些。num_workers 的设置需要在实验中找到最快的取值。

  • 用 del 及时删除不用的中间变量,节约 GPU 存储。
  • 使用 inplace 操作可节约 GPU 存储,如

    1. x = torch.nn.functional.relu(x, inplace=True)
  • 减少 CPU 和 GPU 之间的数据传输。例如如果想知道一个 epoch 中每个 mini-batch 的 loss 和准确率,先将它们累积在 GPU 中等一个 epoch 结束之后一起传输回 CPU 会比每个 mini-batch 都进行一次 GPU 到 CPU 的传输更快。

  • 使用半精度浮点数 half() 会有一定的速度提升,具体效率依赖于 GPU 型号。需要小心数值精度过低带来的稳定性问题。
  • 时常使用 assert tensor.size() == (N, D, H, W) 作为调试手段,确保张量维度和设想中一致。
  • 除了标记 y 外,尽量少使用一维张量,使用 n*1 的二维张量代替,可以避免一些意想不到的一维张量计算结果。
  • 统计代码各部分耗时
    1. with torch.autograd.profiler.profile(enabled=True, use_cuda=False) as profile:
    2. ...
    3. print(profile)
    或者在命令行运行
    1. python -m torch.utils.bottleneck main.py