背景
生产环境中,机器使用一段时间以后【3年+】GPU不定时出现掉线,之前在非容器部署模型时候,需要通过手工执行nvidia-smi命令屏蔽gpu,不让调度器把任务调度这些非健康GPU上,屏蔽对应显卡。
nvidia-smi .. 9 NVIDIA GeForce ... Off | 00000000:C4:00.0 Off | N/A | | 41% 21C P8 6W / 350W | 0MiB / 24268MiB | 0% Default | | | | N/A .. 屏蔽对应显卡 sudo nvidia-smi drain -p 0000:c4:00.0 -m 0
由于volcano是由调度器进行调度的,NVIDIA_VISIBLE_DEVICES=GPU_INDEX屏蔽GPU以后,导致后面index都会变化,可能导致pod重启以后无法工作,如果使用共享GPU,建议不要屏蔽GPU,否则导致容器重启以后分配GPU失败。参考: Volcano 调度器分配共享GPU流程梳理
当前修安排解决问题
Volcano device plugin 可以把非健康GPU index 汇报到node annotations 上, 例如 index 0,3出现故障
apiVersion: v1
kind: Node
metadata:
annotations:
csi.volume.kubernetes.io/nodeid: '{"rbd.csi.ceph.com":"dg-gpu-3090-1-55"}'
io.cilium.network.ipv4-cilium-host: 10.236.1.33
io.cilium.network.ipv4-health-ip: 10.236.1.152
io.cilium.network.ipv4-pod-cidr: 10.236.1.0/24
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/crio/crio.sock
node.alpha.kubernetes.io/ttl: "0"
# unhealthy gpu list
volcano.sh/gpu-unhealthy-ids: '0,3'
# .......
Volcano 调度器获更新node节点GPU状态,排除非健康GPU,让后续任务不要调度这些非健康GPU上
解决方法
Volcano device plugin
提交问题gpu列表,如果所有设备都健康,把列表删除
pkg/plugin/nvidia/kube_interactor.gofunc (ki *KubeInteractor) PatchUnhealthyGPUListOnNode(devices []*Device) error {
var err error
unhealthyGPUsStr := ""
unhealthyGPUs := []string{}
for i := range devices {
if devices[i].Health == pluginapi.Unhealthy {
unhealthyGPUs = append(unhealthyGPUs, fmt.Sprintf("%d", devices[i].Index))
}
}
if len(unhealthyGPUs) > 0 {
unhealthyGPUsStr = strings.Join(unhealthyGPUs, ",")
}
err = wait.PollImmediate(1*time.Second, 10*time.Second, func() (bool, error) {
var node *v1.Node
node, err = ki.clientset.CoreV1().Nodes().Get(context.TODO(), ki.nodeName, metav1.GetOptions{})
if err != nil {
klog.V(4).Infof("failed to get node %s: %v", ki.nodeName, err)
return false, nil
}
newNode := node.DeepCopy()
if unhealthyGPUsStr != "" {
newNode.Annotations[UnhealthyGPUIDs] = unhealthyGPUsStr
} else {
// 所有设备都健康,删除
delete(newNode.Annotations, UnhealthyGPUIDs)
}
_, _, err = nodeutil.PatchNodeStatus(ki.clientset.CoreV1(), types.NodeName(ki.nodeName), node, newNode)
if err != nil {
klog.V(4).Infof("failed to patch volcano unhealthy gpu list %s: %v", unhealthyGPUsStr, err)
return false, nil
}
return true, nil
})
return err
}
当时被收到非健康消息时候,不用修改虚拟设备健康状况,怕导致调度器计算平均值时候错位,直接汇报非健康列表到node annotations 上
pkg/plugin/nvidia/server.go// ListAndWatch lists devices and update that list according to the health status
func (m *NvidiaDevicePlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error {
err := s.Send(&pluginapi.ListAndWatchResponse{Devices: m.virtualDevices})
if err != nil {
log.Fatalf("failed sending devices %d: %v", len(m.virtualDevices), err)
}
for {
select {
case <-m.stop:
return nil
case d := <-m.health:
// FIXME: there is no way to recover from the Unhealthy state.
isChange := false
if d.Health != pluginapi.Unhealthy {
isChange = true
}
d.Health = pluginapi.Unhealthy
log.Printf("'%s' device marked unhealthy: %s", m.resourceName, d.ID)
s.Send(&pluginapi.ListAndWatchResponse{Devices: m.virtualDevices})
if isChange {
// 首次有健康改变为非健康时候汇报,防止频繁汇报k8s,加重apiserver 压力
m.kubeInteractor.PatchUnhealthyGPUListOnNode(m.physicalDevices)
}
}
}
}
目前提交社区,看看反馈:https://github.com/volcano-sh/devices/pull/24
调度器修改
调取器修改比较简单,主要参考aliyun share gpu,读取node annotations 对应key, 获取列表把非健康设备直接排除可以了
pkg/scheduler/api/node_info.go ```go // getUnhealthyGPUs returns all the unhealthy GPU id. func (ni NodeInfo) getUnhealthyGPUs(node v1.Node) (unhealthyGPUs []int) { unhealthyGPUs = []int{} devicesStr, ok := node.Annotations[UnhealthyGPUIDs]if !ok {
return
}
idsStr := strings.Split(devicesStr, “,”) for _, sid := range idsStr {
id, err := strconv.Atoi(sid)
if err != nil {
klog.Warningf("Failed to parse unhealthy gpu id %s due to %v", sid, err)
} else {
unhealthyGPUs = append(unhealthyGPUs, id)
}
} return }
func (ni NodeInfo) setNodeGPUInfo(node v1.Node) {
// …..
ni.GPUDevices = make(map[int]*GPUDevice)
for i := 0; i < int(gpuNumber); i++ {
ni.GPUDevices[i] = NewGPUDevice(i, memoryPerCard)
}
unhealthyGPUs := ni.getUnhealthyGPUs(node)
for i := range unhealthyGPUs {
klog.V(4).Infof(“delete unhealthy gpu id %d from GPUDevices”, unhealthyGPUs[i])
delete(ni.GPUDevices, unhealthyGPUs[i])
}
}
```
目前提交社区,看看反馈:
https://github.com/volcano-sh/volcano/pull/2231
结果
- 显卡出现问题以后,自动屏蔽问题显卡调度,防止更多任务部署问题显卡上,同时减少人工干预。
- 出现问题以后,过几天还是有更多显卡问题风险,需要有必要停止问题机器调度,更多pod退出以后,重新启动机器去解决。
参考
Device plugin: GitHub - volcano-sh/devices: Device plugins for Volcano, e.g. GPU
Scheduer: https://github.com/volcano-sh/volcano