背景

生产环境中,机器使用一段时间以后【3年+】GPU不定时出现掉线,之前在非容器部署模型时候,需要通过手工执行nvidia-smi命令屏蔽gpu,不让调度器把任务调度这些非健康GPU上,屏蔽对应显卡。

nvidia-smi .. 9 NVIDIA GeForce ... Off | 00000000:C4:00.0 Off | N/A | | 41% 21C P8 6W / 350W | 0MiB / 24268MiB | 0% Default | | | | N/A .. 屏蔽对应显卡 sudo nvidia-smi drain -p 0000:c4:00.0 -m 0

由于volcano是由调度器进行调度的,NVIDIA_VISIBLE_DEVICES=GPU_INDEX屏蔽GPU以后,导致后面index都会变化,可能导致pod重启以后无法工作,如果使用共享GPU,建议不要屏蔽GPU,否则导致容器重启以后分配GPU失败。参考: Volcano 调度器分配共享GPU流程梳理

当前修安排解决问题

  • Volcano device plugin 可以把非健康GPU index 汇报到node annotations 上, 例如 index 0,3出现故障

    1. apiVersion: v1
    2. kind: Node
    3. metadata:
    4. annotations:
    5. csi.volume.kubernetes.io/nodeid: '{"rbd.csi.ceph.com":"dg-gpu-3090-1-55"}'
    6. io.cilium.network.ipv4-cilium-host: 10.236.1.33
    7. io.cilium.network.ipv4-health-ip: 10.236.1.152
    8. io.cilium.network.ipv4-pod-cidr: 10.236.1.0/24
    9. kubeadm.alpha.kubernetes.io/cri-socket: /var/run/crio/crio.sock
    10. node.alpha.kubernetes.io/ttl: "0"
    11. # unhealthy gpu list
    12. volcano.sh/gpu-unhealthy-ids: '0,3'
    13. # .......
  • Volcano 调度器获更新node节点GPU状态,排除非健康GPU,让后续任务不要调度这些非健康GPU上

    解决方法

    Volcano device plugin

    提交问题gpu列表,如果所有设备都健康,把列表删除
    pkg/plugin/nvidia/kube_interactor.go

    1. func (ki *KubeInteractor) PatchUnhealthyGPUListOnNode(devices []*Device) error {
    2. var err error
    3. unhealthyGPUsStr := ""
    4. unhealthyGPUs := []string{}
    5. for i := range devices {
    6. if devices[i].Health == pluginapi.Unhealthy {
    7. unhealthyGPUs = append(unhealthyGPUs, fmt.Sprintf("%d", devices[i].Index))
    8. }
    9. }
    10. if len(unhealthyGPUs) > 0 {
    11. unhealthyGPUsStr = strings.Join(unhealthyGPUs, ",")
    12. }
    13. err = wait.PollImmediate(1*time.Second, 10*time.Second, func() (bool, error) {
    14. var node *v1.Node
    15. node, err = ki.clientset.CoreV1().Nodes().Get(context.TODO(), ki.nodeName, metav1.GetOptions{})
    16. if err != nil {
    17. klog.V(4).Infof("failed to get node %s: %v", ki.nodeName, err)
    18. return false, nil
    19. }
    20. newNode := node.DeepCopy()
    21. if unhealthyGPUsStr != "" {
    22. newNode.Annotations[UnhealthyGPUIDs] = unhealthyGPUsStr
    23. } else {
    24. // 所有设备都健康,删除
    25. delete(newNode.Annotations, UnhealthyGPUIDs)
    26. }
    27. _, _, err = nodeutil.PatchNodeStatus(ki.clientset.CoreV1(), types.NodeName(ki.nodeName), node, newNode)
    28. if err != nil {
    29. klog.V(4).Infof("failed to patch volcano unhealthy gpu list %s: %v", unhealthyGPUsStr, err)
    30. return false, nil
    31. }
    32. return true, nil
    33. })
    34. return err
    35. }

    当时被收到非健康消息时候,不用修改虚拟设备健康状况,怕导致调度器计算平均值时候错位,直接汇报非健康列表到node annotations 上
    pkg/plugin/nvidia/server.go

    1. // ListAndWatch lists devices and update that list according to the health status
    2. func (m *NvidiaDevicePlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {
    3. err := s.Send(&pluginapi.ListAndWatchResponse{Devices: m.virtualDevices})
    4. if err != nil {
    5. log.Fatalf("failed sending devices %d: %v", len(m.virtualDevices), err)
    6. }
    7. for {
    8. select {
    9. case <-m.stop:
    10. return nil
    11. case d := <-m.health:
    12. // FIXME: there is no way to recover from the Unhealthy state.
    13. isChange := false
    14. if d.Health != pluginapi.Unhealthy {
    15. isChange = true
    16. }
    17. d.Health = pluginapi.Unhealthy
    18. log.Printf("'%s' device marked unhealthy: %s", m.resourceName, d.ID)
    19. s.Send(&pluginapi.ListAndWatchResponse{Devices: m.virtualDevices})
    20. if isChange {
    21. // 首次有健康改变为非健康时候汇报,防止频繁汇报k8s,加重apiserver 压力
    22. m.kubeInteractor.PatchUnhealthyGPUListOnNode(m.physicalDevices)
    23. }
    24. }
    25. }
    26. }

    目前提交社区,看看反馈:https://github.com/volcano-sh/devices/pull/24

    调度器修改

    调取器修改比较简单,主要参考aliyun share gpu,读取node annotations 对应key, 获取列表把非健康设备直接排除可以了
    pkg/scheduler/api/node_info.go ```go // getUnhealthyGPUs returns all the unhealthy GPU id. func (ni NodeInfo) getUnhealthyGPUs(node v1.Node) (unhealthyGPUs []int) { unhealthyGPUs = []int{} devicesStr, ok := node.Annotations[UnhealthyGPUIDs]

    if !ok {

    1. return

    }

    idsStr := strings.Split(devicesStr, “,”) for _, sid := range idsStr {

    1. id, err := strconv.Atoi(sid)
    2. if err != nil {
    3. klog.Warningf("Failed to parse unhealthy gpu id %s due to %v", sid, err)
    4. } else {
    5. unhealthyGPUs = append(unhealthyGPUs, id)
    6. }

    } return }

func (ni NodeInfo) setNodeGPUInfo(node v1.Node) { // ….. ni.GPUDevices = make(map[int]*GPUDevice) for i := 0; i < int(gpuNumber); i++ { ni.GPUDevices[i] = NewGPUDevice(i, memoryPerCard) } unhealthyGPUs := ni.getUnhealthyGPUs(node) for i := range unhealthyGPUs { klog.V(4).Infof(“delete unhealthy gpu id %d from GPUDevices”, unhealthyGPUs[i]) delete(ni.GPUDevices, unhealthyGPUs[i]) } } ``` 目前提交社区,看看反馈:
https://github.com/volcano-sh/volcano/pull/2231

结果