HPA 依赖环境

通常情况下,控制器将从一系列的聚合 API(metrics.k8s.io、custom.metrics.k8s.io 和 external.metrics.k8s.io)中获取度量值。 metrics.k8s.io API 通常由 Metrics 服务器(需要额外启动)提供。 可以从 metrics-server 获取更多信息。需要成功部署metrics-server以后k8s才可以支持POD hpa 功能。

安装metrics-server

部署metrics-server可以在github下载最新版本部署组件high-availability.yaml,当前下载[metrics-server-helm-chart-3.8.2](https://github.com/kubernetes-sigs/metrics-server/releases/tag/metrics-server-helm-chart-3.8.2)

修改配置

因为部署集群的时候,CA 证书并没有把各个节点的 IP 签上去,所以这里 Metrics Server 通过 IP 去请求时,提示签的证书没有对应的 IP(错误:x509: cannot validate certificate for 172.26.2.131 because it doesn’t contain any IP SANs),我们可以添加一个—kubelet-insecure-tls参数跳过证书校验

  1. ###
  2. - args:
  3. - --cert-dir=/tmp
  4. # 添加参数跳过证书校验
  5. - --kubelet-insecure-tls
  6. - --secure-port=4443
  7. - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
  8. - --kubelet-use-node-status-port
  9. - --metric-resolution=15s
  10. ###

验证部署

可以通过如下命令来验证:

  1. $ kubectl get pods -n kube-system -l k8s-app=metrics-server
  2. NAME READY STATUS RESTARTS AGE
  3. metrics-server-78647766bb-8rs4j 1/1 Running 0 115m
  4. metrics-server-78647766bb-pkxt9 1/1 Running 0 115m
  5. $ kubectl logs -n kube-system metrics-server-78647766bb-8rs4j
  6. I0303 04:24:38.304408 1 serving.go:342] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
  7. I0303 04:24:39.238923 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
  8. I0303 04:24:39.238954 1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
  9. I0303 04:24:39.238982 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
  10. I0303 04:24:39.239022 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
  11. I0303 04:24:39.239031 1 configmap_cafile_content.go:201] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
  12. I0303 04:24:39.239060 1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
  13. I0303 04:24:39.239398 1 dynamic_serving_content.go:131] "Starting controller" name="serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key"
  14. I0303 04:24:39.239897 1 secure_serving.go:266] Serving securely on [::]:4443
  15. I0303 04:24:39.239974 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
  16. W0303 04:24:39.240080 1 shared_informer.go:372] The sharedIndexInformer has started, run more than once is not allowed
  17. I0303 04:24:39.339998 1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController
  18. I0303 04:24:39.340088 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
  19. I0303 04:24:39.340158 1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-fil
  20. $ kubectl get apiservice | grep metrics
  21. v1beta1.metrics.k8s.io kube-system/metrics-server True 9m
  22. $ kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes"
  23. {"kind":"NodeMetricsList","apiVersion":"metrics.k8s.io/v1beta1","metadata":{},"items":[{"metadata":{"name":"dg-gpu-3090-1-116","creationTimestamp":"2022-03-03T06:22:48Z","labels":{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/os":"linux","classify":"volcano-gpu-share","gpuDeviceType":"GeForce-RTX-3090","gpuDriverVersion":"470.63.01","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"dg-gpu-3090-1-116","kubernetes.io/os":"linux"}},"timestamp":"2022-03-03T06:22:31Z","window":"10.085s","usage":{"cpu":"212655585n","memory":"27810752Ki"}},{"metadata":{"name":"gz-cs-gpu-3-131","creationTimestamp":"2022-03-03T06:22:48Z","labels":{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/os":"linux","classify":"volcano-gpu-share","gpuDeviceType":"GeForce-GTX-1080-Ti","gpuDriverVersion":"470.63.01","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"gz-cs-gpu-3-131","kubernetes.io/os":"linux","node-role.kubernetes.io/control-plane":"","node-role.kubernetes.io/master":"","node.kubernetes.io/exclude-from-external-load-balancers":""}},"timestamp":"2022-03-03T06:22:33Z","window":"10.087s","usage":{"cpu":"1629935462n","memory":"15271200Ki"}}]}

CPU HPA 测试demo

下面建立一个nginx服务,添加HPA条件, 10%cpu使用率进行HPA。
demo-hpa.yaml

  1. apiVersion: apps/v1
  2. kind: Deployment
  3. metadata:
  4. name: hpa-demo
  5. spec:
  6. selector:
  7. matchLabels:
  8. app: nginx
  9. template:
  10. metadata:
  11. labels:
  12. app: nginx
  13. spec:
  14. containers:
  15. - name: nginx
  16. image: docker.io/library/nginx:1.11.3
  17. ports:
  18. - containerPort: 80
  19. resources:
  20. requests:
  21. memory: 50Mi
  22. cpu: 50m
  23. ---
  24. apiVersion: v1
  25. kind: Service
  26. metadata:
  27. labels:
  28. app: nginx
  29. name: hpa-demo
  30. spec:
  31. ports:
  32. - name: http
  33. port: 80
  34. protocol: TCP
  35. targetPort: 80
  36. selector:
  37. app: nginx
  38. type: ClusterIP
  39. ---
  40. apiVersion: autoscaling/v1
  41. kind: HorizontalPodAutoscaler
  42. metadata:
  43. name: hpa-demo
  44. spec:
  45. maxReplicas: 5
  46. minReplicas: 1
  47. scaleTargetRef:
  48. apiVersion: apps/v1
  49. kind: Deployment
  50. name: hpa-demo
  51. targetCPUUtilizationPercentage: 10

注意 10% CPU使用率不是1000m微核进行计算的,而是resources.requests.cpu值作为基准值计算的。由于上面requests.cpu配置是50m, 所以10%使用率是50m 10% = 5m。相当于机器top命令0.5%使用率例如requests.cpu配置为5000m, 如果HPA也配置10%CPU使用率:5000m 10%=500m, 相当于机器top显示cpu使用率50%。

提交k8s上

  1. $ kubectl apply -f ./demo-hpa.yaml -n default

测试命令

获取demo-hpa service 服务地址, 发送请求

  1. $ kubectl get service -n default
  2. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
  3. hpa-demo ClusterIP 10.235.50.218 <none> 80/TCP 129m
  4. $ while true; do wget -q -O- http://10.235.50.218; done

脚本启动一段时间以后,获取HPA监控情况

  1. $ kubectl get hpa
  2. NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
  3. hpa-demo Deployment/hpa-demo 10%/10% 1 5 5 7m17s
  4. $ kubectl describe hpa hpa-demo
  5. Name: hpa-demo
  6. Namespace: default
  7. Labels: <none>
  8. Annotations: <none>
  9. CreationTimestamp: Thu, 03 Mar 2022 14:38:07 +0800
  10. Reference: Deployment/hpa-demo
  11. Metrics: ( current / target )
  12. resource cpu on pods (as a percentage of request): 0% (0) / 10%
  13. Min replicas: 1
  14. Max replicas: 5
  15. Deployment pods: 3 current / 1 desired
  16. Conditions:
  17. Type Status Reason Message
  18. ---- ------ ------ -------
  19. AbleToScale True SucceededRescale the HPA controller was able to update the target scale to 1
  20. ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from cpu resource utilization (percentage of request)
  21. ScalingLimited True TooFewReplicas the desired replica count is less than the minimum replica count
  22. Events:
  23. Type Reason Age From Message
  24. ---- ------ ---- ---- -------
  25. Warning FailedGetResourceMetric 17m horizontal-pod-autoscaler failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
  26. Warning FailedComputeMetricsReplicas 17m horizontal-pod-autoscaler invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
  27. Warning FailedGetResourceMetric 17m horizontal-pod-autoscaler failed to get cpu utilization: did not receive metrics for any ready pods
  28. Warning FailedComputeMetricsReplicas 17m horizontal-pod-autoscaler invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: did not receive metrics for any ready pods
  29. Normal SuccessfulRescale 15m horizontal-pod-autoscaler New size: 3; reason: cpu resource utilization (percentage of request) above target
  30. Normal SuccessfulRescale 14m horizontal-pod-autoscaler New size: 4; reason: cpu resource utilization (percentage of request) above target
  31. Normal SuccessfulRescale 13m horizontal-pod-autoscaler New size: 5; reason: cpu resource utilization (percentage of request) above target
  32. Normal SuccessfulRescale 20s horizontal-pod-autoscaler New size: 3; reason: All metrics below target
  33. Normal SuccessfulRescale 5s horizontal-pod-autoscaler New size: 1; reason: All metrics below targe