Pytorch - [pytorch]显卡被什么进程占用且为什么被占用 - 《Basic Summary》

查看占用显卡的进程
代码DEBUG
- 原因
- 解决方法

用pytorch训练模型指定的device均没有cuda:0，结果cuda:0的显存占用也在上涨。为了调查这一问题，首先需要明确0卡被哪些进程占用了。

查看占用显卡的进程

# 未安装fuser则先安装
apt-get install fuser
# 查看占用显卡的各个进程
fuser -v /dev/nvidia*

有一种可能是杀掉的进程有残留，此时使用 top 命令，将top中不显示的进程杀掉即可。也可直接将某张卡的进程批量清理

sudo fuser -v /dev/nvidia* |awk '{for(i=1;i<=NF;i++)print "kill -9 " $i;}' | sudo sh

我这边的话并不是残留。我发现是指定了cuda:1的网络训练进程占用了cuda:0的显存，因此对代码进行了DEBUG。

代码DEBUG

对代码进行了检查，model和数据都已经to(指定device)了。去搜索的时候找到了一些线索

原因

cuda out of memory error when GPU0 memory is fully utilized #3477上显示，pytorch开发者会在0卡上进行一些初始化操作，像CPU-GPU复制和.tolist()之类的东西都需要包含在torch.cuda.device_of(tensor)中，即0卡上。

解决方法

为了避免这一行为，有三种做法。

with torch.cuda.device(idx)

with torch.cuda.device(5):
      model.cuda()
      X_train = torch.randn(seq_len, batch_size, features)
      y_train = torch.randn(batch_size)
      X_train, y_train = Variable(X_train).cuda(), Variable(y_train).cuda()

CUDA_VISIBLE_DEVICES

在终端执行程序时使用

CUDA_VISIBLE_DEVICES=0 python train.py
CUDA_VISIBLE_DEVICES=0,1 python train.py

在代码中设置可见GPU

这里要注意一点，需要在import torch之前指定
具体方法如下

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '1,2'
import torch

一些参考：
https://blog.csdn.net/qq_35037684/article/details/106649307?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param
https://blog.csdn.net/weixin_38376691/article/details/96435895?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.nonecase&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.nonecase