最近在做深度学习在通信中的一些应用探索,主要使用PyTorch的框架,第一次真正意义上的使用,难免遇到不少坑,在此记录。

1. 在使用DataLoader加载数据时报错:OSError: [Errno 22] Invalid argument

  1. for i, data in enumerate(loader):

并且这个错误在数据量小的时候没有发生,在大数据量下才产生。我的办法是将DataLoader的参数num_workers由1改为0,程序正常运行。

2. 在给网络输入Tensor时报错:RuntimeError: Expected object of scalar type Double but got scalar type Float for argument #2 ‘mat2’ in call to _th_mm

按照提示是数据类型的原因,查看输入数据类型为torch.float64,将输入数据类型转为torch.float32,搞定。

3. 使用view时报错:RuntimeError: view size is not compatible with input tensor’s size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(…) instead.

tensor不是contiguous 引起的错误。查看_targets.is_contiguous()为False。
两种解决办法:1)按照提示使用reshape代替;2)将变量先转为_contiguous ,再进行view:

  1. targets.contiguous().view(targets.size(0)*targets.size(1),-1)

4. 使用交叉熵损失函数报错:Expected object of type torch.LongTensor but found type torch.FloatTensor for argument #2 ‘target’

数据类型不匹配的原因,按照提示,将目标值类型转为torch.Long。

5. 还是使用交叉熵损失函数报错:multi-target not supported at C:\w\1\s\windows\pytorch\aten\src\THNN/generic/ClassNLLCriterion.c:21

原因是torch.nn.CrossEntropyLoss()接受的目标值必须是类标值,而不是one-hot编码,将目标值改为类标值即可。

https://zhuanlan.zhihu.com/p/112099126

6.Segmentation fault (core dumped)

7.OSError: image file is truncated (16 bytes not processed)

  1. Traceback (most recent call last):
  2. 2012 File "train.py", line 131, in <module>
  3. 2013 for _, (input_images, ground_truths, masks) in enumerate(data_loader):
  4. 2014 File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
  5. 2015 data = self._next_data()
  6. 2016 File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
  7. 2017 return self._process_data(data)
  8. 2018 File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
  9. 2019 data.reraise()
  10. 2020 File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/_utils.py", line 394, in reraise
  11. 2021 raise self.exc_type(msg)
  12. 2022 OSError: Caught OSError in DataLoader worker process 3.
  13. 2023 Original Traceback (most recent call last):
  14. 2024 File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
  15. 2025 data = fetcher.fetch(index)
  16. 2026 File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
  17. 2027 data = [self.dataset[idx] for idx in possibly_batched_index]
  18. 2028 File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
  19. 2029 data = [self.dataset[idx] for idx in possibly_batched_index]
  20. 2030 File "/home/guoxiefan/PyTorch/ImageInpainting/LBAM/src/dataset.py", line 76, in __getitem__
  21. 2031 ground_truth = self.image_files_transforms(image.convert('RGB'))
  22. 2032 File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/PIL/Image.py", line 873, in convert
  23. 2033 self.load()
  24. 2034 File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/PIL/ImageFile.py", line 247, in load
  25. 2035 "(%d bytes not processed)" % len(b)
  26. 2036 OSError: image file is truncated (16 bytes not processed)

解决方案:[Link]

It’s probably you installed Pillow instead of PIL. Try adding these todatasets.pyfile at line 14:
from PIL import ImageFileImageFile.LOAD_TRUNCATED_IMAGES = True

8RuntimeError: cuda runtime error (60) : peer mapping resources exhausted at /opt/conda/conda-bld/pytorch_1579022051443/work/aten/src/THC/THCGeneral.cpp:141

  1. Traceback (most recent call last):
  2. File "train.py", line 136, in <module>
  3. outputs = generator(input_images, masks)
  4. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
  5. result = self.forward(*input, **kwargs)
  6. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 148, in forward
  7. inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
  8. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 159, in scatter
  9. return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
  10. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 36, in scatter_kwargs
  11. inputs = scatter(inputs, target_gpus, dim) if inputs else []
  12. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 28, in scatter
  13. res = scatter_map(inputs)
  14. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map
  15. return list(zip(*map(scatter_map, obj)))
  16. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 13, in scatter_map
  17. return Scatter.apply(target_gpus, None, dim, obj)
  18. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 89, in forward
  19. outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
  20. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/cuda/comm.py", line 147, in scatter
  21. return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
  22. RuntimeError: cuda runtime error (60) : peer mapping resources exhausted at /opt/conda/conda-bld/pytorch_1579022051443/work/aten/src/THC/THCGeneral.cpp:141

解决:对于nn.DataParallel作用的nn.Module,传入参数一般为实数,或者为原始数据([B C H * W])。传入参数与并行有关,需要特别注意。nn.DataParallel并行切分 B 维度。

9.RuntimeError: received 0 items of ancdata

  1. Traceback (most recent call last):
  2. File "train.py", line 123, in <module>
  3. for _, (input_images, ground_truths, masks) in enumerate(data_loader):
  4. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
  5. data = self._next_data()
  6. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
  7. idx, data = self._get_data()
  8. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
  9. success, data = self._try_get_data()
  10. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data
  11. data = self._data_queue.get(timeout=timeout)
  12. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/queues.py", line 113, in get
  13. return _ForkingPickler.loads(res)
  14. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd
  15. fd = df.detach()
  16. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
  17. return reduction.recv_handle(conn)
  18. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
  19. return recvfds(s, 1)[0]
  20. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/reduction.py", line 161, in recvfds
  21. len(ancdata))
  22. RuntimeError: received 0 items of ancdata

设置加载数据的num_workers=0,
可查看,当需共享的tensor超过open files限制时,即会出现该错误。
解决办法有2种:

1、增加open files的限制数量:
不能用sudo ulimit -n命令,而需执行:

sudo sh -c “ulimit -n 65535 && exec su $LOGNAME”
2、修改多线程的tensor方式为file_system(默认方式为file_descriptor,受限于open files数量):
torch.multiprocessing.set_sharing_strategy(‘file_system’)

解决 [Link] [Link]

  1. Traceback (most recent call last):
  2. File "train.py", line 124, in <module>
  3. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
  4. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
  5. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
  6. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data
  7. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/queues.py", line 113, in get
  8. File "/data/guoxiefan/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 310, in rebuild_storage_filename
  9. RuntimeError: unable to open shared memory object </torch_24388_2219814394> in read-write mode

尽量不要在调用函数中重复创建模型对象(然后.cuda()放到GPU上),例如VGG提取特征的VGG模型,最好在一次创建,然后再传参。