内存管理 - 从GPU OOM开始说Tensorflow的BFC内存管理 - 《软件开发》

per_process_gpu_memory_fraction 参数
allow_growth 参数
1 Chunk 结构体
2 Region
3. Bin
4. Chunk 的合并和拆分

在平台上跑 GPU 训练，结果 CUDA OOM 了，错误提示

E  Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 11711807488

对会话没有进行任何 GPU 相关设置，tensorflow 给出建议，可以用参数控制 GPU 的内存分配

tf_config.gpu_options.allow_growth = Truetf_config.gpu_options.per_process_gpu_memory_fraction = 0.1

per_process_gpu_memory_fraction 参数

per_process_gpu_memory_fraction 参数，这是一个控制 GPU 单个 process 的内存因子，这是一个筏值，通过筏值来决定获取 GPU 的内存比，从而控制留给系统的 GPU 的内存，如果不设置，在有效内存足够的情况下，tensorflow 只预留给系统 225M 当有效内存小于 2G 的时候，而当有效内存大于 2G 的时候预留筏值 0.05 的有效内存且至少 300M 的内存，这是一种贪婪式的占有内存。设置因子可以有效控制你需要的内存量。

  double config_memory_fraction =      options.config.gpu_options().per_process_gpu_memory_fraction();  if (config_memory_fraction == 0) {    allocated_memory = available_memory;    const int64 min_system_memory = MinSystemMemory(available_memory);    if (min_system_memory < allocated_memory) {      allocated_memory -= min_system_memory;    allocated_memory = total_memory * config_memory_fraction;

如果你跑在一个已经内存使用比较多的平台里，每个 GPU 的剩余内存并不一定一样，设置因子是基于所有 process 内存的，单个因子无法控制每个 process 的内存分配，会导致由于单个 process 的内存不够而导致失败。

allow_growth 参数

看到这参数也许会很奇怪，allow_growth 字面意思是允许增长，也就是允许后期继续分配内存？实际上在 tensorflow 启动的时候，并不会真实的去申请内存，初始参数的生成只是为了管控后期真实允许使用，申请内存的大小。

在 tensorflow 上有一层虚拟的内存管理 BFC

这是一个虚拟的内存分配器，实现类似 Doug Lea 简单版本 malloc（dlmalloc），通过合并进行内存碎片整理，实现’best-fit with coalescing’的算法，要求所有的分配内存都必须调用该接口。

1 Chunk 结构体

1.1 结构体

这是 tensorflow 的最小内存单位，由数倍 256bytes（kMinAllocationSize）的连续内存块组成，tensorflow 的内存管理是基于 chunk 的管理。

从GPU OOM开始说Tensorflow的BFC内存管理 - 图1

1.1.1 chunkhandle

chunkhandle 是 chunk 数组向量的索引，在 tensorflow 保存着所有 chunk 的数组向量，而数组向量的下标就是 chunkhandle

    // If not kInvalidChunkHandle, the memory referred to by 'prev' is directly    // preceding the memory used by this chunk.  E.g., It should start    ChunkHandle prev = kInvalidChunkHandle;    // If not kInvalidChunkHandle, the memory referred to by 'next' is directly    // following the memory used by this chunk.  E.g., It should be at    ChunkHandle next = kInvalidChunkHandle;

在 Chunk 结构体中有两个前后 chunkhandle（所有 chunk 数组的索引），chunkhandle 指向前后分别是相邻的连续内存块

1.1.2 ptr 指针

ptr 是一个内存指针，指向的是内存的启始位置，因为 chunk 指向的是连续内存，所以只记录它的大小

1.2 chunk 的申请

  std::vector<Chunk> chunks_ TF_GUARDED_BY(lock_);
  // Pointer to head of linked list of free Chunks
  ChunkHandle free_chunks_list_ TF_GUARDED_BY(lock_);

Tensorflow 会保存一个所有 chunk 的数组向量chunks。它在空间上必须时连续的、
为了避免频繁的申请和释放 chunk，被释放的 chunk 会被重用，为了快速的查找已经释放的 chunk，tensorflow 又构建了已经释放的 chunk 的链表 free_chunks_list，freechunks_list指向链表的头。

从GPU OOM开始说Tensorflow的BFC内存管理 - 图2

freechunks_list相当于一个可重用的chunk的缓存stack。每次分配一个chunk时，优先考虑从缓存stack中pop一个chunk，如果缓存为空，才需要新建一个。
释放chunk时，不会立即释放，而是添加到缓存中。


BFCAllocator::ChunkHandle BFCAllocator::AllocateChunk() {
  if (free_chunks_list_ != kInvalidChunkHandle) {
    ChunkHandle h = free_chunks_list_;
    Chunk* c = ChunkFromHandle(h);
    free_chunks_list_ = c->next;
    return h;
  } else {
    ChunkHandle h = chunks_.size();
    chunks_.resize(h + 1);
    return h;
  }
}
void BFCAllocator::DeallocateChunk(ChunkHandle h) {
  Chunk* c = ChunkFromHandle(h);
  c->allocation_id = -1;
  c->bin_num = kInvalidBinNum;
  c->next = free_chunks_list_;
  free_chunks_list_ = h;
}

1.3 chunk 的删除

chunk 是重用的，chunk 的删除需要抹去 chunk 里的特性，比如 ptr，当然不是释放 ptr 指向的内存，而是将 Region 里所对应该地址的 chunkhandle 的指向设置无效，同时将该 chunk 添加到已经释放的 chunk 的链表中的头部，freechunks_list指向刚释放的 chunk

2 Region

Region 是一块已经分配的连续的内存块，一个 region 可以被拆分为多个 chunk，一个 chunk 指向的是多个连续的 256byte 的内存块

从GPU OOM开始说Tensorflow的BFC内存管理 - 图3

2.1 Region 的申请

在真正需要使用内存的时候才申请 Region

size_t bytes = std::min(curr_region_allocation_bytes_, available_bytes);void* mem_addr = suballocator_->Alloc(32, bytes);

在上面代码中我们可以看到每次申请 Region 的内存由下面几个参数控制：

curr_region_allocation_bytes 参数

    // 1MiB smallest initial allocation, unless total memory available    curr_region_allocation_bytes_ =        RoundedBytes(std::min(total_memory, size_t{1048576}));    curr_region_allocation_bytes_ = RoundedBytes(total_memory);

这里的 allow_growth 参数就是在前面的

tf_config.gpu_options.allow_growth = True

当 allowgrowth 关闭的时候，curr_region_allocation_bytes就是默认的剩余内存大小，也就是只有一个 Region

当 allowgrowth 打开的时候，curr_region_allocation_bytes的值是最小为 1M 的多个 Region，currregion_allocation_bytes默认以 2 倍的速度增长，也就是每次申请 Region 的内存是连续最小以 2 倍速度增长的。

如果实际需要申请的内存大于 currregion_allocation_bytes的时候，以 2 倍的 currregion_allocation_bytes速度增长直到满足需要的内存。

bool increased_allocation = false;while (rounded_bytes > curr_region_allocation_bytes_) {    curr_region_allocation_bytes_ *= 2;    increased_allocation = true;

从GPU OOM开始说Tensorflow的BFC内存管理 - 图4

Available_bytes 参数

available_bytes 指的是剩余的可被分配的内存，在初始化的时候 Tensorflow 会获取 GPU 的有效内存，每次申请的内存会从剩余内存中减去，也就是在整个运算过程中 GPU 的剩余内存只会在程序开始的时候获取一次，如果程序是在运行在 GPU 的平台上，剩余内存会不停的变化，有效的内存在程序开始运行的时候获取（并没有真的去申请），那么在计算过程中内存申请很有可能出现 OOM。

2.2 Region 的 ChunkHandle

每个 Region 会被以 256bytes 大小分割成多个 chunkhandle 的数组，chunkhandle 指向的就是前面章节中讨论的 chunk 向量数组的位置。

2.3 Region 数组

每一次的申请连续的内存都会生成一个 Region，多个 Region 组成了 Region 向量数组

std::vector<AllocationRegion> regions_;

如何定位 chunk 是属于哪个 Region 呢？每个 Region 会记录起始地址和结束地址，而 chunk 中会保存 chunk 的起始地址，只要比较 chunk 的起始地址和 region 的地址范围，就能确定所属于的 Region

3. Bin

在前面章节中讨论了 Region, Chunk，但当申请新的内存的时候，如何更快高效的查找匹配的空闲 chunk，这是非常重要的。查找每个 Region 里的空闲 chunk，显然是非常低效率的，tensorflow 基于 chunk 上构建了一个全局的 bin，每个 bin 里管理的是一定范围的内存大小的 chunk（内存大小范围 (2^bin_num)256 到 (2^(bin_num+1))256-1 的，bin_num 代表的是 bin 的序列号）每个 chunk 是以 256bytes 数倍大小的内存块，bin 管理的是空闲的 chunk 块。

从GPU OOM开始说Tensorflow的BFC内存管理 - 图5

每个 Bin 里会保存着一个空闲的 free 的 chunk 的 set

typedef std::set<ChunkHandle, ChunkComparator> FreeChunkSet;    FreeChunkSet free_chunks;

应用程序先申请内存
计算从而确定内存大小所属于的 Bin
遍历 Bin 里面的空闲的 Chunk Set，如果找不到继续查找更大的 Bin，直到找到空闲的内存
如果依然找不到，那么就需要真实的向驱动申请内存，申请 currregion_allocation_bytes大小的内存块为一个 Region，同时也是一个大的 chunk 块，并将这个 chunk 块作为空闲块插入回所对应的 bin 中空闲 chunk set，然后继续查找。
如果找到，那么需要判断一下空闲的 chunk 内存块是否 2 倍于所需要的内存
为了避免内存的浪费，大的空闲 chunk 块会倍拆分成 2 个 chunk 块，小的 chunk 块给程序使用，而剩余大的 chunk 块重新插入回所对应的 Bin 的空闲 chunk set

4. Chunk 的合并和拆分

为了更有效的利用内存，对一个较大的 chunk 内存块进行 chunk 的拆分，该拆分策略前面章节里已经介绍过，而在 chunk 进行释放的时候，tensorflow 会尝试对 chunk 进行合并，chunk 合并的策略：地址相邻的内存块才可以合并

还记得 chunk 的 Prev，Next 么？

  BFCAllocator::ChunkHandle h_neighbor = c->next;  new_chunk->next = h_neighbor;

在 chunk 拆分的时候，就是相邻的 chunk 块，在 split 一个大的 Chunk 成两个 chunk 块的时候, 新的 chunk 块 prev 会指向另一个 chunk 块, next 指向原来大的 chunk 块的邻居，同时大 chunk 块的邻居 prev 指向新的 chunk 块。

在释放 chunk 的时候，会检查 prev 和 next，如果 prev，next 指向的 chunk 没有被使用，那么就会尝试合并。
https://blog.csdn.net/raintungli/article/details/80135997