RuntimeError: CUDA error: n illegal memory access was encountered
发表于 2020-08-20 更新于 2020-11-11 分类于 coding , pytorch 阅读次数: 241
本文字数: 564 阅读时长 ≈ 1 分钟
在 GPU 上训练时,报了这个错。目前为止,始终无法解决。看起来像是数据量大小引起的,之前用小数据训练没问题,改用大数据之后就报错了。但是根据查到的资料,别人在其他的情况下也有这问题。
1 |
RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered |
|
后来在网上看到了别人的讨论,有人运行了以下代码:
1 |
CUDA_LAUNCH_BLOCKING=1 python train.py |
|
我试了一下,然而,过了几个 epoch 之后,又报错了。不过加了 CUDA_LAUNCH_BLOCKING=1
之后,报的错更详细了,为:
12345678910111213141516171819202122232425262728293031323334353637383940 |
Traceback (most recent calls WITHOUT Sacred internals): File “train.py”, line 98, in run model(dataloader) File “/home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 550, in call result = self.forward(input, *kwargs) File “/home/zcy/pythonworkspace/DSTex/model.py”, line 117, in forward self.trainepoch(pbar_train, cur_epoch, train_batch_num, train_statistics_every) File “/home/zcy/python_workspace/DSTex/model.py”, line 162, in train_epoch train_perf = self.train_batch(train_data, cur_epoch) File “/home/zcy/python_workspace/DSTex/model.py”, line 215, in train_batch loss.backward() File “/home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/tensor.py”, line 198, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File “/home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/autograd/__init.py”, line 100, in backward allow_unreachable=True) # allow_unreachable flagRuntimeError: CUDA error: an illegal memory access was encountered (copy_device_to_device at /opt/conda/conda-bld/pytorch_1587428398394/work/aten/src/ATen/native/cuda/Copy.cu:61)frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7f5fc5c77b5e in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libc10.so)frame #1: at::native::copy_device_to_device(at::TensorIterator&, bool) + 0x861 (0x7f5fc82b12b1 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)frame #2: + 0x240f91c (0x7f5fc82b391c in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)frame #3: + 0x9146ac (0x7f5fed76d6ac in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #4: + 0x911d73 (0x7f5fed76ad73 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #5: at::native::copy(at::Tensor&, at::Tensor const&, bool) + 0x44 (0x7f5fed76c834 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #6: at::native::embedding_dense_backward_cuda(at::Tensor const&, at::Tensor const&, long, long, bool) + 0x4bd (0x7f5fc83fdbdd in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)frame #7: + 0xde41dc (0x7f5fc6c881dc in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)frame #8: + 0xe2404c (0x7f5fedc7d04c in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #9: + 0x28037f1 (0x7f5fef65c7f1 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #10: + 0xe2404c (0x7f5fedc7d04c in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #11: at::native::embedding_backward(at::Tensor const&, at::Tensor const&, long, long, bool, bool) + 0x124 (0x7f5fed7ca1a4 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #12: + 0xeaefe0 (0x7f5fedd07fe0 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #13: + 0x29acffa (0x7f5fef805ffa in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #14: + 0xee78d9 (0x7f5fedd408d9 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #15: torch::autograd::generated::EmbeddingBackward::apply(std::vector >&&) + 0x1cd (0x7f5fef45ef9d in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #16: + 0x2ae8215 (0x7f5fef941215 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #17: torch::autograd::Engine::evaluate_function(std::shared_ptr&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7f5fef93e513 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #18: torch::autograd::Engine::thread_main(std::shared_ptr const&, bool) + 0x3d2 (0x7f5fef93f2f2 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #19: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f5fef937969 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)frame #20: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f5ff2c7e558 in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/libtorch_python.so)frame #21: + 0xc819d (0x7f5ff56e119d in /home/zcy/anaconda3/envs/nlp/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6)frame #22: + 0x76db (0x7f6017a0d6db in /lib/x86_64-linux-gnu/libpthread.so.0)frame #23: clone + 0x3f (0x7f601773688f in /lib/x86_64-linux-gnu/libc.so.6) |
|
看起来有点像是张量在拷贝的时候出的错,回想起之前有人说,改一下代码就解决了,所以打算试一下他提供的代码:
1 |
torch.cuda.set_device() |
|
简单来说,就是在你调用 tensor.cuda()
或者 model.to(device)
之后再调用上面的代码即可。更新:这方法还是不行,一会之后有报错了。
后来设置 torch.backends.cudnn.benchmark=False
,也是失败了。
如果还是不行,可以试一下 issue 中各种的玄学方法。。。
这貌似是一个随机的错误,目前还是无解。。。
更新:好像发现为什么了。
其实就是在调用交叉熵函数的时候,真实标签的值大于预测出概率分布的维度。例如,概率分布的维度是 200,而对应的标签为 205。我的场景是使用 pointer network 预测语句中的索引位置。由于数据处理失误,导致在真实标签中多加了 1。
注:这个 error 可能会由很多问题引起,不能保证我的解决办法对你有效。
我的尝试
有可能安装版本问题,安装GPU有问题