背景
由 k8s HAP 主要支持CPU, Memory自动伸缩,支持自定义标签流程比较复杂, 无法使用时序数据库运算语法,无法满足模型弹性伸缩要求。
KEDA 允许对事件驱动的 Kubernetes 工作负载进行细粒度的自动缩放(包括到/从零)。KEDA 作为 Kubernetes 指标服务器,允许用户使用专用的 Kubernetes 自定义资源定义来定义自动缩放规则。Keda支持主流时序数据库,而且支持聚合,运算语句。尝试通过keda框架关联GPU伸缩。
目前 k8s cadvisor 没有 GPU方面监控。需要补全GPU监控,而且可以提供GPU和POD关联上。
GPU 指标量收集
指标收集器安装
指标收集器使用NVIDIA/dcgm-exporter, 可以参考页面NVIDIA/dcgm-exporter, 安装通过 k8s yaml 安装
dcgm-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: "dcgm-exporter"
namespace: monitoring
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.4.0"
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.4.0"
template:
metadata:
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.4.0"
name: "dcgm-exporter"
spec:
nodeSelector:
classify: nvidia-gpu # 限制在nvdia-gpu标签机器运行
containers:
- image: "nvcr.io/nvidia/k8s/dcgm-exporter:2.2.9-2.4.0-ubuntu18.04"
env:
- name: "DCGM_EXPORTER_LISTEN"
value: ":9400"
- name: "DCGM_EXPORTER_KUBERNETES"
value: "true"
name: "dcgm-exporter"
ports:
- name: "metrics"
containerPort: 9400
securityContext:
runAsNonRoot: false
runAsUser: 0
volumeMounts:
- name: "pod-gpu-resources"
readOnly: true
mountPath: "/var/lib/kubelet/pod-resources"
volumes:
- name: "pod-gpu-resources"
hostPath:
path: "/var/lib/kubelet/pod-resources"
---
kind: Service
apiVersion: v1
metadata:
name: "dcgm-exporter"
namespace: monitoring
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.4.0"
spec:
selector:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.4.0"
ports:
- name: "metrics"
port: 9400
提交配置k8s
kubectl apply -f ./dcgm-exporter.yaml
支持指标
Nvidia GPU 指标量收集, 提供GPU指标和对应GPU指标属于那个容器、POD。指标量能够和POD、Container进行关联例如:
DCGM_FI_DEV_DEC_UTIL{gpu="2",UUID="GPU-09239326-4101-8dd9-272b-32bf5615f996",device="nvidia2",modelName="NVIDIA GeForce RTX 3090",Hostname="dcgm-exporter-b6sgj",container="dlapi",namespace="cloud-gpu",pod="dlapi-5caeac72582e35c560667d21-6141a02e0d828f8e81bee064-595z5d2"} 0
GPU 使用率
DCGM_FI_DEV_GPU_UTIL
GPU 显存
- GPU 已用内存: DCGM_FI_DEV_FB_USED
- GPU 空闲内存:DCGM_FI_DEV_FB_FREE
- 内存使用率: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)
GPU 温度
DCGM_FI_DEV_GPU_TEMP
GPU指标量保存Prometheus
目前我们使用prometheus operator, 只要配置service-monitor就可以对应prometheus收集指标量了。下面yaml如下:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
generation: 1
labels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.4.0"
name: dcgm-exporter
namespace: monitoring
spec:
endpoints:
- interval: 60s # 拉取频率
port: http-metrics
jobLabel: k8s-app # 对应 prometheus-operator 创建收集实例 serviceMonitorSelector.matchExpressions
namespaceSelector:
matchNames:
- monitoring
selector:
matchLabels:
app.kubernetes.io/name: "dcgm-exporter"
app.kubernetes.io/version: "2.4.0"
基于GPU使用率弹性伸缩
Keda 支持获取对应 promethus 拉取指标量进行弹性伸缩, 支持promethus 查询语句。模型deployment弹性伸缩,可以根据模型平均GPU使用率进行伸缩。
获取制定模型实例平均GPU使用率
例如模型 deployment 名称 58ecbfb07c7ae7056a96a3b1-image-common-production, 获取这个deployment 所有实例GPU平均使用率:
avg(DCGM_FI_DEV_GPU_UTIL{pod_name=~"58ecbfb07c7ae7056a96a3b1-image-common-production-.*"})
配置GPU弹性伸缩样例
下面是通过GPU使用率进行
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: 58ecbfb07c7ae7056a96a3b1-image-common-production
namespace: {{ deployment-namespace }}
spec:
maxReplicaCount: 12 # 最大扩展实例数目
scaleTargetRef:
name: 58ecbfb07c7ae7056a96a3b1-image-common-production
kind: Deployment
apiVersion: apiVersion
triggers:
- type: prometheus
metadata:
serverAddress: http://<prometheus-host>:9090
metricName: 58ecbfb07c7ae7056a96a3b1-image-common-production-gpu-avg # 每个伸缩指标量都必须唯一
threshold: '70' # GPU 使用率 > 70 增加实例
query: avg(DCGM_FI_DEV_GPU_UTIL{pod_name=~"58ecbfb07c7ae7056a96a3b1-image-common-production-.*"})