问题描述

在最近的工作当中,接到一个任务:使用docker搭建一个GPU的TensorFlow环境,然后将容器打包给其他人使用,然后还有其他一些相关的库就不概述了。中间也是被要求甲方弄得有点无话可说,中间有很多问题,正好就记录一下。避免以后再有形似的工作不知怎么写。

环境描述

  • CentOS Linux release 7.7.1908 (Core)

    • 对应命令:
      1. cat /etc/redhat-release
  • docker版本:Docker version 18.09.8, build 0dd43dd87f

    • 对应命令:
      1. docker --version
  • GPU显卡型号:NVIDIA GeForce GTX 1080ti

    开始搭建环境

    首先是查询资料知道了要想在docker container中使用GPU驱动那么首先是安装nvidia-docker,关于nvidia-docker的描述是:”Build and run Docker containers leveraging NVIDIA GPUs”。目前的nvidia-docker是第二代,默认是nvidia-docker2。其实所有的安装教程都不如直接看官方文档:nvidia-docker官方存储库,然后根据官方的文档说明,目前使用nvidia-docker只需要在宿主机上安装GPU的驱动就好,不用安装cuda了。

    安装GPU驱动

    首先去nvidia的官方驱动页面:显卡驱动下载页面选择适合自己版本的,然后下载。然后运行就好。这是第一种方法,但是我不是使用这一种方法,我是用的在线的安装:

    1. nvidia-detect -v
    2. rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
    3. rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm
    4. yum install nvidia-detect
    5. nvidia-detect -v
    6. lspci | grep -i NVIDIA
    7. yum search kmod-nvidia
    8. yum install kmod-nvidia.x86_64

    这一种方法说的是不太稳定,但是方便。

    安装nvidia-docker

    关于nvidia-docker安装,直接按照官方存储库的readme文件中的提示就好

    1. # 获取系统型号
    2. distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    3. # 通过curl下载安装文件
    4. curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
    5. # 执行yum安装
    6. sudo yum install -y nvidia-container-toolkit
    7. # 重启docker
    8. sudo systemctl restart docker

    修改docker配置文件

    要想在容器中使用nvidia,首先是修改docker的配置文件:

  1. # 修改配置文件
  2. vim /etc/docker/daemon.json
  3. # 写入以下配置
  4. {
  5. "default-runtime": "nvidia",
  6. "registry-mirrors":["https://registry.docker-cn.com"],
  7. "runtimes": {
  8. "nvidia": {
  9. "path": "nvidia-container-runtime",
  10. "runtimeArgs": []
  11. }
  12. }
  13. }

指定docker的runtime为nvidia

在容器中使用GPU

首先是拉取镜像,这是第一个坑,不能拉取普通的镜像,需要拉取nvidia的镜像,关于TensorFlow,也就需要去docker hub中去拉取安装好的TensorFlow的镜像,然后这次需求比较特殊,需要的操作系统版本是Ubuntu 16.04:

  1. docker pull tensorflow/tensorflow:2.1.0-custom-op-gpu-ubuntu16

tensorflow在一点几版本中CPU版本和GPU版本是分开的,但是针对最新的TensorFlow中就没有分开的。所以直接拉取就好。然后是启动容器, 命令如下,就是很简单的docker命令:

  1. docker run -itd --name containername -p 22000:22 tensorflow:2.1.0-custom-op-gpu-ubuntu16

然后是进入容器:

  1. docker exec -it container-name bash

image.png
然后在容器中验证能否使用nvidia,命令:nvidia-smi:

  1. root@14d4b1d94053:~# nvidia-smi
  2. Thu Mar 12 06:39:51 2020
  3. +-----------------------------------------------------------------------------+
  4. | NVIDIA-SMI 440.59 Driver Version: 440.59 CUDA Version: 10.2 |
  5. |-------------------------------+----------------------+----------------------+
  6. | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
  7. | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
  8. |===============================+======================+======================|
  9. | 0 GeForce GTX 108... Off | 00000000:01:00.0 Off | N/A |
  10. | 22% 40C P0 53W / 250W | 0MiB / 11178MiB | 0% Default |
  11. +-------------------------------+----------------------+----------------------+
  12. +-----------------------------------------------------------------------------+
  13. | Processes: GPU Memory |
  14. | GPU PID Type Process name Usage |
  15. |=============================================================================|
  16. | No running processes found |
  17. +-----------------------------------------------------------------------------+
  18. root@14d4b1d94053:~# nvcc -V
  19. nvcc: NVIDIA (R) Cuda compiler driver
  20. Copyright (c) 2005-2019 NVIDIA Corporation
  21. Built on Sun_Jul_28_19:07:16_PDT_2019
  22. Cuda compilation tools, release 10.1, V10.1.243

然后是验证TensorFlow中能否正常GPU,根据TensorFlow官方页面:

  1. python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
  2. root@14d4b1d94053:~# python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
  3. 2020-03-12 06:42:45.846888: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
  4. 2020-03-12 06:42:45.848669: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
  5. 2020-03-12 06:42:51.977544: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
  6. 2020-03-12 06:42:52.259782: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
  7. 2020-03-12 06:42:52.260348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
  8. pciBusID: 0000:01:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
  9. coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
  10. 2020-03-12 06:42:52.260423: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
  11. 2020-03-12 06:42:52.260475: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
  12. 2020-03-12 06:42:52.381636: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
  13. 2020-03-12 06:42:52.428450: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
  14. 2020-03-12 06:42:52.437360: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
  15. 2020-03-12 06:42:52.441722: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
  16. 2020-03-12 06:42:52.441909: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
  17. 2020-03-12 06:42:52.442232: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
  18. 2020-03-12 06:42:52.443970: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
  19. 2020-03-12 06:42:52.445548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
  20. 2020-03-12 06:42:52.446925: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
  21. 2020-03-12 06:42:52.587060: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3408000000 Hz
  22. 2020-03-12 06:42:52.588038: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x618e180 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
  23. 2020-03-12 06:42:52.588173: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
  24. 2020-03-12 06:42:52.674842: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
  25. 2020-03-12 06:42:52.675414: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x6190c60 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
  26. 2020-03-12 06:42:52.675443: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
  27. 2020-03-12 06:42:52.675688: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
  28. 2020-03-12 06:42:52.676130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
  29. pciBusID: 0000:01:00.0 name: GeForce GTX 1080 Ti computeCapability: 6.1
  30. coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
  31. 2020-03-12 06:42:52.676179: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
  32. 2020-03-12 06:42:52.676196: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
  33. 2020-03-12 06:42:52.676217: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
  34. 2020-03-12 06:42:52.676237: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
  35. 2020-03-12 06:42:52.676252: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
  36. 2020-03-12 06:42:52.676273: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
  37. 2020-03-12 06:42:52.676290: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
  38. 2020-03-12 06:42:52.676347: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
  39. 2020-03-12 06:42:52.676748: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
  40. 2020-03-12 06:42:52.677095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
  41. 2020-03-12 06:42:52.684803: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
  42. 2020-03-12 06:42:52.896910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
  43. 2020-03-12 06:42:52.896955: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
  44. 2020-03-12 06:42:52.896983: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
  45. 2020-03-12 06:42:52.907562: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
  46. 2020-03-12 06:42:52.909553: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
  47. 2020-03-12 06:42:52.910387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10435 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
  48. tf.Tensor(-186.68738, shape=(), dtype=float32)

然后也这次工作中也有其他库的安装要求,然后通过pytorch也可以验证:

  1. python -c "import torch;print(torch.cuda.is_available())"
  2. (base) root@aa09995aa171:/# python -c "import torch;print(torch.cuda.is_available())"
  3. True

到这里环境基本上就搭建好了。然后说一些其他的准备。

容器的优化

首先说,作为一个Python开发者来说,Python环境管理是基础的准备,这次要求也是要有环境管理,我们使用Anaconda来进行环境管理,然后可以使用wget获取anaconda的安装脚本,因为是使用的TensorFlow的镜像,里面wget、vim等基础工具都是安装好了的。如果没有安装的话我们还需要安装

Ubuntu更换阿里云源

  1. vim /etc/apt/sources.list
  2. # 写入配置deb http://mirrors.aliyun.com/ubuntu/ xenial main
  3. deb-src http://mirrors.aliyun.com/ubuntu/ xenial main
  4. deb http://mirrors.aliyun.com/ubuntu/ xenial-updates main
  5. deb-src http://mirrors.aliyun.com/ubuntu/ xenial-updates main
  6. deb http://mirrors.aliyun.com/ubuntu/ xenial universe
  7. deb-src http://mirrors.aliyun.com/ubuntu/ xenial universe
  8. deb http://mirrors.aliyun.com/ubuntu/ xenial-updates universe
  9. deb-src http://mirrors.aliyun.com/ubuntu/ xenial-updates universe
  10. deb http://mirrors.aliyun.com/ubuntu/ xenial-security main
  11. deb-src http://mirrors.aliyun.com/ubuntu/ xenial-security main
  12. deb http://mirrors.aliyun.com/ubuntu/ xenial-security universe
  13. deb-src http://mirrors.aliyun.com/ubuntu/ xenial-security universe
  14. # 更新
  15. apt-get update
  16. # 安装vim
  17. apt-get install wget

安装Anaconda

国内通过wget去anaconda的官网下载会很慢,我们可以去清华源下载:anaconda安装包清华下载,然后安装anaconda:

  1. bash Anaconda3-2019.10-Linux-x86_64.sh

然后默认安装就好,安装好后执行一下:

  1. source .bashrc

就可以进入conda的环境,然后修改anaconda的源为清华源,不然创建环境的时候很慢:

  1. # 初始化conda配置
  2. conda config
  3. # 修改配置
  4. vim .condarc
  5. # 写入配置
  6. channels:
  7. - defaults
  8. show_channel_urls: true
  9. channel_alias: https://mirrors.tuna.tsinghua.edu.cn/anaconda
  10. default_channels:
  11. - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
  12. - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
  13. - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
  14. - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/pro
  15. - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
  16. custom_channels:
  17. conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  18. msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  19. bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  20. menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  21. pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
  22. simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud

然后如果对科学计算没什么大的需求,安装miniconda就可以了。

修改pip源

pip安装第三方库的时候会非常慢,我们可以修改pip源为阿里源:

  1. mkdir .pip
  2. vim .pip/pip.conf
  3. # 写入配置
  4. [global]
  5. index-url = https://mirrors.aliyun.com/pypi/simple/
  6. [install]
  7. trusted-host=mirrors.aliyun.com

容器导出

说实话这是我整个工作中遇到问题最大的一部分,docker容器的导入导出有两个方式docker export和docker import以及docker save和docker load。docker import和docker load导入都是导入成一个镜像,然后再跑一个容器。然后关于这两个的区别完全可以另开一篇,这里就简单说一下,docker export导出的是容器的快照,不会保存元数据。然后如果用这一种方式,就算你在你机器上创建的容器导出再导入都会出错。会使用不了GPU资源,这玩意搞我好久哦。所以,如果你想让其他人也使用也就需要使用docker save,docker save是针对镜像的,所以我们需要先将我们搭建好的docker容器提交为一个镜像:

  1. docker commit container-name image-name:version

然后使用docker save命令导出就好:

  1. docker save -o image-name.tar image-name:version

然后就OK了
docker export和docker save的区别可以参看文章:docker export与docker save的区别