背景

生产环境中，机器使用一段时间以后【3年+】GPU不定时出现掉线，之前在非容器部署模型时候，需要通过手工执行nvidia-smi命令屏蔽gpu，不让调度器把任务调度这些非健康GPU上，屏蔽对应显卡。

由于volcano是由调度器进行调度的，NVIDIA_VISIBLE_DEVICES=GPU_INDEX屏蔽GPU以后，导致后面index都会变化，可能导致pod重启以后无法工作，如果使用共享GPU，建议不要屏蔽GPU，否则导致容器重启以后分配GPU失败。参考: Volcano 调度器分配共享GPU流程梳理

当前修安排解决问题

Volcano device plugin 可以把非健康GPU index 汇报到node annotations 上, 例如 index 0，3出现故障

apiVersion: v1
kind: Node
metadata:
annotations:
  csi.volume.kubernetes.io/nodeid: '{"rbd.csi.ceph.com":"dg-gpu-3090-1-55"}'
  io.cilium.network.ipv4-cilium-host: 10.236.1.33
  io.cilium.network.ipv4-health-ip: 10.236.1.152
  io.cilium.network.ipv4-pod-cidr: 10.236.1.0/24
  kubeadm.alpha.kubernetes.io/cri-socket: /var/run/crio/crio.sock
  node.alpha.kubernetes.io/ttl: "0"
  # unhealthy gpu list
  volcano.sh/gpu-unhealthy-ids: '0,3'
  # .......

Volcano 调度器获更新node节点GPU状态，排除非健康GPU，让后续任务不要调度这些非健康GPU上

解决方法

Volcano device plugin

提交问题gpu列表，如果所有设备都健康，把列表删除
pkg/plugin/nvidia/kube_interactor.go

func (ki *KubeInteractor) PatchUnhealthyGPUListOnNode(devices []*Device) error {
  var err error
  unhealthyGPUsStr := ""
  unhealthyGPUs := []string{}
  for i := range devices {
      if devices[i].Health == pluginapi.Unhealthy {
          unhealthyGPUs = append(unhealthyGPUs, fmt.Sprintf("%d", devices[i].Index))
      }
  }
  if len(unhealthyGPUs) > 0 {
      unhealthyGPUsStr = strings.Join(unhealthyGPUs, ",")
  }
  err = wait.PollImmediate(1*time.Second, 10*time.Second, func() (bool, error) {
      var node *v1.Node
      node, err = ki.clientset.CoreV1().Nodes().Get(context.TODO(), ki.nodeName, metav1.GetOptions{})
      if err != nil {
          klog.V(4).Infof("failed to get node %s: %v", ki.nodeName, err)
          return false, nil
      }
      newNode := node.DeepCopy()
      if unhealthyGPUsStr != "" {
          newNode.Annotations[UnhealthyGPUIDs] = unhealthyGPUsStr
      } else {
          // 所有设备都健康，删除
          delete(newNode.Annotations, UnhealthyGPUIDs)
      }  
       _, _, err = nodeutil.PatchNodeStatus(ki.clientset.CoreV1(),             types.NodeName(ki.nodeName), node, newNode)
      if err != nil {
          klog.V(4).Infof("failed to patch volcano unhealthy gpu list %s: %v", unhealthyGPUsStr, err)
          return false, nil
      }
      return true, nil
  })
  return err
}

当时被收到非健康消息时候，不用修改虚拟设备健康状况，怕导致调度器计算平均值时候错位，直接汇报非健康列表到node annotations 上
pkg/plugin/nvidia/server.go

// ListAndWatch lists devices and update that list according to the health status
func (m *NvidiaDevicePlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {
  err := s.Send(&pluginapi.ListAndWatchResponse{Devices: m.virtualDevices})
  if err != nil {
      log.Fatalf("failed sending devices %d: %v", len(m.virtualDevices), err)
  }
  for {
      select {
      case <-m.stop:
          return nil
      case d := <-m.health:
          // FIXME: there is no way to recover from the Unhealthy state.
          isChange := false
          if d.Health != pluginapi.Unhealthy {
              isChange = true
          }
          d.Health = pluginapi.Unhealthy
          log.Printf("'%s' device marked unhealthy: %s", m.resourceName, d.ID)
          s.Send(&pluginapi.ListAndWatchResponse{Devices: m.virtualDevices})
          if isChange {
              // 首次有健康改变为非健康时候汇报，防止频繁汇报k8s,加重apiserver 压力
              m.kubeInteractor.PatchUnhealthyGPUListOnNode(m.physicalDevices)
          }
      }
  }
}

目前提交社区，看看反馈：https://github.com/volcano-sh/devices/pull/24

调度器修改

调取器修改比较简单，主要参考aliyun share gpu，读取node annotations 对应key, 获取列表把非健康设备直接排除可以了
pkg/scheduler/api/node_info.go ```go // getUnhealthyGPUs returns all the unhealthy GPU id. func (ni NodeInfo) getUnhealthyGPUs(node v1.Node) (unhealthyGPUs []int) { unhealthyGPUs = []int{} devicesStr, ok := node.Annotations[UnhealthyGPUIDs]

if !ok {

  return

}

idsStr := strings.Split(devicesStr, “,”) for _, sid := range idsStr {

  id, err := strconv.Atoi(sid)
  if err != nil {
      klog.Warningf("Failed to parse unhealthy gpu id %s due to %v", sid, err)
  } else {
      unhealthyGPUs = append(unhealthyGPUs, id)
  }

} return }

func (ni NodeInfo) setNodeGPUInfo(node v1.Node) { // ….. ni.GPUDevices = make(map[int]*GPUDevice) for i := 0; i < int(gpuNumber); i++ { ni.GPUDevices[i] = NewGPUDevice(i, memoryPerCard) } unhealthyGPUs := ni.getUnhealthyGPUs(node) for i := range unhealthyGPUs { klog.V(4).Infof(“delete unhealthy gpu id %d from GPUDevices”, unhealthyGPUs[i]) delete(ni.GPUDevices, unhealthyGPUs[i]) } } ``` 目前提交社区，看看反馈:
https://github.com/volcano-sh/volcano/pull/2231

结果

显卡出现问题以后，自动屏蔽问题显卡调度，防止更多任务部署问题显卡上，同时减少人工干预。
出现问题以后，过几天还是有更多显卡问题风险，需要有必要停止问题机器调度，更多pod退出以后，重新启动机器去解决。
参考
Device plugin: GitHub - volcano-sh/devices: Device plugins for Volcano, e.g. GPU
Scheduer: https://github.com/volcano-sh/volcano

AI容器平台分享

Volcano 调度器自动排除节点不健康GPU索引

背景

当前修安排解决问题

解决方法

Volcano device plugin

调度器修改

结果

参考