TensorFlow训练踩坑记录（更新中）

使用GPU训练：控制显存资源
- （二）限制当前程序可用GPU范围
  - 方法一：设置程序可见CPU范围tf.config.experimental.set_visible_devices
  - 方法二：使用环境变量 CUDA_VISIBLE_DEVICES
- （三）控制每个（单）GPU的显存使用
  - 方法一：动态申请显存tf.config.experimental.set_memory_growth
  - 方法二：直接限制显存占用（通过tf.config.experimental.set_virtual_device_configuration配置虚拟 GPU 设备）
补充：官方文档对GPU的使用指南（v2.4.0）
参考

文 @nikkyoya(u2331987) 2021.1.15

使用GPU训练：控制显存资源

背景问题 使用Tf.keras训练分类模型，训练中报错：Blas GEMM launch failed ```powershell InternalError: Blas GEMM launch failed : a.shape=(32, 784), b.shape=(784, 300), m=32, n=300, k=784 [[node sequential/dense/MatMul (defined at :1) ]] [Op:__inference_train_function_494]

Function call stack: train_function


**错误原因****
1. 由于其他pythonx程序占用了GPU资源，导致现有程序无法分配足够的资源去执行当前程序；
1. tensorflow-gpu默认占用所有GPU的几乎全部显存（为了减少内存碎片，更有效地利用设备上相对宝贵的 GPU 内存资源）
<a name="CsCpq"></a>
### 解决方法
<a name="wToUj"></a>
#### （一）获取当前主机上特定运算设备的列表
```python
# 获取所有GPU组成list
gpus = tf.config.experimental.list_physical_devices(device_type='GPU')
# 获取所有CPU组成list
cpus = tf.config.experimental.list_physical_devices(device_type='CPU')
print(gpus, cpus)

（二）限制当前程序可用GPU范围

方法一：设置程序可见CPU范围`tf.config.experimental.set_visible_devices`

设置之后，当前程序只会使用自己可见的设备，不可见的设备不会被当前程序使用。

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only use the first GPU
  try:
    tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
  except RuntimeError as e:
    # Visible devices must be set before GPUs have been initialized
    print(e)

方法二：使用环境变量 `CUDA_VISIBLE_DEVICES`

在终端输入

export CUDA_VISIBLE_DEVICES=2,3

或者在代码里加入

import os
os.environ['CUDA_VISIBLE_DEVICES'] = "2,3"

（三）控制每个（单）GPU的显存使用

方法一：动态申请显存`tf.config.experimental.set_memory_growth`

仅在需要时申请显存空间（程序初始运行时消耗很少的显存，随着程序的运行而动态申请显存）

单GPU

# 设置按需申请
tf.config.experimental.set_memory_growth(gpu[0], True)

多GPU

# 设置按需申请
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

方法二：直接限制显存占用（通过`tf.config.experimental.set_virtual_device_configuration`配置虚拟 GPU 设备）

程序不会超出限定的显存大小，若超出则报错

# 对需要进行限制的GPU进行设置
tf.config.experimental.set_virtual_device_configuration(gpus[0],                                                 [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=8192)])

该方法还可以在只有单GPU的环境模拟多GPU进行调试。

tf.config.experimental.set_virtual_device_configuration(
    gpus[0],
    [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=2048),
     tf.config.experimental.VirtualDeviceConfiguration(memory_limit=2048)])

补充：官方文档对GPU的使用指南（v2.4.0）

获得运算和张量分配到的目标设备

需将tf.debugging.set_log_device_placement(True)放在程序之前：

tf.debugging.set_log_device_placement(True)

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

print(c)

手动设置运算执行设备

tf.debugging.set_log_device_placement(True)

# Place tensors on the CPU
with tf.device('/CPU:0'):
  a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
  b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
print(c)

使用多GPU系统上的单个GPU

如果系统上有多个 GPU，则默认情况下会选择具有最小 ID 的 GPU

指定需要的 GPU

tf.debugging.set_log_device_placement(True)
try:
  # Specify an invalid GPU device
  with tf.device('/device:GPU:2'):
    a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
    b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
    c = tf.matmul(a, b)
except RuntimeError as e:
  print(e)

使用多个GPU（`tf.distribute.MirroredStrategy`）

此程序会在每个 GPU 上运行模型的一个副本，并将输入数据拆分到每个 GPU 上，也就是所谓的“数据并行”。

tf.debugging.set_log_device_placement(True)
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
  inputs = tf.keras.layers.Input(shape=(1,))
  predictions = tf.keras.layers.Dense(1)(inputs)
  model = tf.keras.models.Model(inputs=inputs, outputs=predictions)
  model.compile(loss='mse',
                optimizer=tf.keras.optimizers.SGD(learning_rate=0.2))

通过在每个 GPU 上构建模型来手动实现复制

tf.debugging.set_log_device_placement(True)
gpus = tf.config.experimental.list_logical_devices('GPU')
if gpus:
  # Replicate your computation on multiple GPUs
  c = []
  for gpu in gpus:
    with tf.device(gpu.name):
      a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
      b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
      c.append(tf.matmul(a, b))
  with tf.device('/CPU:0'):
    matmul_sum = tf.add_n(c)
  print(matmul_sum)

参考

官方文档：TensorFlow Core 使用GPU
【Tensorflow】Tensorflow 2.0 GPU的使用与分配
 tensorflow2.0显存设置