Core

kubelet 的 preemption 模块负责处理高优先级的 Pod 抢占策略。详见:Pod 优先级和抢占 | Kubernetes :::info 如果一个 Pod 无法被调度,调度程序会尝试抢占(驱逐)较低优先级的 Pod, 以使高优先级 Pod 可以被调度。 :::

什么时候触发驱逐低优先级 Pod

当 Predicate 错误 slice 中只有资源不足导致的错误时才触发。

驱逐哪些 Pod

为了最小化影响,有一个优先级算法,描述如下。
影响 Pods 的优先级: guaranteed pods > burstable pods > besteffort pods (三个优先级是 QoS 中的定义 -> 配置 Pod 的服务质量 | Kubernetes
最小影响的定义:fewest pods evicted > fewest total requests of pods (即驱逐的 Pod 数量越少越优,然后再考虑驱逐的 Pod request 的资源和越小越优) :::info 从这里可以学到要保证服务稳定,就要申请 Guaranteed 级别的 Pod 优先级。 :::

Interface

该接口只有一个方法,可以看到触发 Pod 抢占的时机是 Predicate 失败的时候调用该接口方法。

  1. // AdmissionFailureHandler is an interface which defines how to deal with a failure to admit a pod.
  2. // This allows for the graceful handling of pod admission failure.
  3. type AdmissionFailureHandler interface {
  4. HandleAdmissionFailure(admitPod *v1.Pod, failureReasons []PredicateFailureReason) ([]PredicateFailureReason, error)
  5. }

Structure

该模块比较简单,接口只需要实现一个方法即可。

  1. // CriticalPodAdmissionHandler is an AdmissionFailureHandler that handles admission failure for Critical Pods.
  2. // If the ONLY admission failures are due to insufficient resources, then CriticalPodAdmissionHandler evicts pods
  3. // so that the critical pod can be admitted. For evictions, the CriticalPodAdmissionHandler evicts a set of pods that
  4. // frees up the required resource requests. The set of pods is designed to minimize impact, and is prioritized according to the ordering:
  5. // minimal impact for guaranteed pods > minimal impact for burstable pods > minimal impact for besteffort pods.
  6. // minimal impact is defined as follows: fewest pods evicted > fewest total requests of pods.
  7. // finding the fewest total requests of pods is considered besteffort.
  8. type CriticalPodAdmissionHandler struct {
  9. getPodsFunc eviction.ActivePodsFunc
  10. killPodFunc eviction.KillPodFunc
  11. recorder record.EventRecorder
  12. }

注释将功能说明的非常清楚,已在 Core 段上文列出。

HandleAdmissionFailure

实现如下:

  1. 检查目标 Pod 是否是高优先级 Pod。
  2. 为拒绝调度错误分类,资源不足错误一类,其他错误另一类。
  3. 判断是否有其他错误,有的话不做抢占逻辑直接返回错误。
  4. 确定驱逐低优先级 Pod ,这里会出现两种情况:
    1. 抢占成功,err 值 nil。
    2. 抢占失败,返回 err。
      1. // HandleAdmissionFailure gracefully handles admission rejection, and, in some cases,
      2. // to allow admission of the pod despite its previous failure.
      3. func (c *CriticalPodAdmissionHandler) HandleAdmissionFailure(admitPod *v1.Pod, failureReasons []lifecycle.PredicateFailureReason) ([]lifecycle.PredicateFailureReason, error) {
      4. if !kubetypes.IsCriticalPod(admitPod) {
      5. return failureReasons, nil
      6. }
      7. // InsufficientResourceError is not a reason to reject a critical pod.
      8. // Instead of rejecting, we free up resources to admit it, if no other reasons for rejection exist.
      9. nonResourceReasons := []lifecycle.PredicateFailureReason{}
      10. resourceReasons := []*admissionRequirement{}
      11. for _, reason := range failureReasons {
      12. if r, ok := reason.(*lifecycle.InsufficientResourceError); ok {
      13. resourceReasons = append(resourceReasons, &admissionRequirement{
      14. resourceName: r.ResourceName,
      15. quantity: r.GetInsufficientAmount(),
      16. })
      17. } else {
      18. nonResourceReasons = append(nonResourceReasons, reason)
      19. }
      20. }
      21. if len(nonResourceReasons) > 0 {
      22. // Return only reasons that are not resource related, since critical pods cannot fail admission for resource reasons.
      23. return nonResourceReasons, nil
      24. }
      25. err := c.evictPodsToFreeRequests(admitPod, admissionRequirementList(resourceReasons))
      26. // if no error is returned, preemption succeeded and the pod is safe to admit.
      27. return nil, err
      28. }

      evictPodsToFreeRequests()

      驱逐逻辑会检查是哪些资源不足,尝试抢占这类思源。
  • 抢占是基于 requests 而不是 limit 的。
  • 可能会驱逐大于高优先级 Pod 的 requests 值。
  1. 选出需要被驱逐的 Pods 们
  2. 挨个阻塞式调用 Kill 命令 kill pod 和他里面的 container
    1. 如果失败,只是这一次 kill 失败了,kill 操作已经提交,kubelet 会在自己的 syncPod 循环中重试(即失败直接尝试 kill 下一个)
  3. 更新抢占指标中的信息。

    1. // evictPodsToFreeRequests takes a list of insufficient resources, and attempts to free them by evicting pods
    2. // based on requests. For example, if the only insufficient resource is 200Mb of memory, this function could
    3. // evict a pod with request=250Mb.
    4. func (c *CriticalPodAdmissionHandler) evictPodsToFreeRequests(admitPod *v1.Pod, insufficientResources admissionRequirementList) error {
    5. podsToPreempt, err := getPodsToPreempt(admitPod, c.getPodsFunc(), insufficientResources)
    6. if err != nil {
    7. return fmt.Errorf("preemption: error finding a set of pods to preempt: %v", err)
    8. }
    9. for _, pod := range podsToPreempt {
    10. // record that we are evicting the pod
    11. c.recorder.Eventf(pod, v1.EventTypeWarning, events.PreemptContainer, message)
    12. // this is a blocking call and should only return when the pod and its containers are killed.
    13. klog.V(3).InfoS("Preempting pod to free up resources", "pod", klog.KObj(pod), "podUID", pod.UID, "insufficientResources", insufficientResources)
    14. err := c.killPodFunc(pod, true, nil, func(status *v1.PodStatus) {
    15. status.Phase = v1.PodFailed
    16. status.Reason = events.PreemptContainer
    17. status.Message = message
    18. })
    19. if err != nil {
    20. klog.ErrorS(err, "Failed to evict pod", "pod", klog.KObj(pod))
    21. // In future syncPod loops, the kubelet will retry the pod deletion steps that it was stuck on.
    22. continue
    23. }
    24. if len(insufficientResources) > 0 {
    25. metrics.Preemptions.WithLabelValues(insufficientResources[0].resourceName.String()).Inc()
    26. } else {
    27. metrics.Preemptions.WithLabelValues("").Inc()
    28. }
    29. klog.InfoS("Pod evicted successfully", "pod", klog.KObj(pod))
    30. }
    31. return nil
    32. }

    getPodsToPreempt

  4. 从 kubelet 中获取当前所有 active 的 pods ,并按照 QoS 优先级排序,返回三类 QoS pods 的 slice 切片。

  5. 检查如果驱逐所有 bestEffortPods 和 burstablePods 都无法满足目标 pod 的资源要求的情况下,要驱逐的 GuaranteedPods
  6. 检查如果驱逐所有的 bestEffortPods 和上一步已经确定至少要驱逐的 GuaranteedPods 都无法满足 Pod 资源要求的情况下,需要驱逐的 burstablePods
  7. 检查如果驱逐上一步已经确定至少要驱逐的 burstablePods和上上一步已经确定至少要驱逐的 GuaranteedPods 都无法满足 Pod 资源要求的情况下,需要驱逐的 bestEffortPods

这里着重看一下顺序,为什么是先计算 guaranteedPods 先被计算驱逐呢?举一个例子好理解:
假如高优先级 Pod A 要求 100M 内存。burstablePods (总共 50m),bestEffortPods(总共 40m) 加起来也就总共 request 90M 内存,guaranteedPods 只有一个 Pod B request 90M 内存。要满足 A 的要求,怎么着也得驱逐 B 了吧。
然后按照顺序计算需要驱逐的 burstablePods ,驱逐 B 之后,A 还需要 10M 内存才满足需求呢。但是明显,应该驱逐低优先级的 bestEffortPods。这一次计算,就会发现需要满足的内存要求编程了 100M - 90M(B)-40M(bestEffortPods)= -30M。不需要驱逐 burstablePods 就够啦。
最后就计算到底要驱逐 bestEffortPods 中的多少 Pods 了。很明显,只需要再驱逐大于 10M Requests 的 bestEffortPods 即可。

  1. // getPodsToPreempt returns a list of pods that could be preempted to free requests >= requirements
  2. func getPodsToPreempt(pod *v1.Pod, pods []*v1.Pod, requirements admissionRequirementList) ([]*v1.Pod, error) {
  3. bestEffortPods, burstablePods, guaranteedPods := sortPodsByQOS(pod, pods)
  4. // make sure that pods exist to reclaim the requirements
  5. unableToMeetRequirements := requirements.subtract(append(append(bestEffortPods, burstablePods...), guaranteedPods...)...)
  6. if len(unableToMeetRequirements) > 0 {
  7. return nil, fmt.Errorf("no set of running pods found to reclaim resources: %v", unableToMeetRequirements.toString())
  8. }
  9. // find the guaranteed pods we would need to evict if we already evicted ALL burstable and besteffort pods.
  10. guaranteedToEvict, err := getPodsToPreemptByDistance(guaranteedPods, requirements.subtract(append(bestEffortPods, burstablePods...)...))
  11. if err != nil {
  12. return nil, err
  13. }
  14. // Find the burstable pods we would need to evict if we already evicted ALL besteffort pods, and the required guaranteed pods.
  15. burstableToEvict, err := getPodsToPreemptByDistance(burstablePods, requirements.subtract(append(bestEffortPods, guaranteedToEvict...)...))
  16. if err != nil {
  17. return nil, err
  18. }
  19. // Find the besteffort pods we would need to evict if we already evicted the required guaranteed and burstable pods.
  20. bestEffortToEvict, err := getPodsToPreemptByDistance(bestEffortPods, requirements.subtract(append(burstableToEvict, guaranteedToEvict...)...))
  21. if err != nil {
  22. return nil, err
  23. }
  24. return append(append(bestEffortToEvict, burstableToEvict...), guaranteedToEvict...), nil
  25. }

getPodsToPreemptByDistance

  1. 进入检测循环,直到满足需求的 requests。即 insufficient resource 长度为零
  2. 遍历所有的 Pod,根据算法选出最适合驱逐的 Pod
    1. 这里算法下面详细讲,但是要清楚,算法是多维考虑的,比如有多个资源要求的时候(CPU、内存),会同时计算两个维度总和最佳驱逐 Pod
  3. 把需求的资源量减去选择出的最佳驱逐 Pod 的资源量,得到剩下还需要满足的资源列表。继续循环。

    1. // getPodsToPreemptByDistance finds the pods that have pod requests >= admission requirements.
    2. // Chooses pods that minimize "distance" to the requirements.
    3. // If more than one pod exists that fulfills the remaining requirements,
    4. // it chooses the pod that has the "smaller resource request"
    5. // This method, by repeatedly choosing the pod that fulfills as much of the requirements as possible,
    6. // attempts to minimize the number of pods returned.
    7. func getPodsToPreemptByDistance(pods []*v1.Pod, requirements admissionRequirementList) ([]*v1.Pod, error) {
    8. podsToEvict := []*v1.Pod{}
    9. // evict pods by shortest distance from remaining requirements, updating requirements every round.
    10. for len(requirements) > 0 {
    11. if len(pods) == 0 {
    12. return nil, fmt.Errorf("no set of running pods found to reclaim resources: %v", requirements.toString())
    13. }
    14. // all distances must be less than len(requirements), because the max distance for a single requirement is 1
    15. bestDistance := float64(len(requirements) + 1)
    16. bestPodIndex := 0
    17. // Find the pod with the smallest distance from requirements
    18. // Or, in the case of two equidistant pods, find the pod with "smaller" resource requests.
    19. for i, pod := range pods {
    20. dist := requirements.distance(pod)
    21. if dist < bestDistance || (bestDistance == dist && smallerResourceRequest(pod, pods[bestPodIndex])) {
    22. bestDistance = dist
    23. bestPodIndex = i
    24. }
    25. }
    26. // subtract the pod from requirements, and transfer the pod from input-pods to pods-to-evicted
    27. requirements = requirements.subtract(pods[bestPodIndex])
    28. podsToEvict = append(podsToEvict, pods[bestPodIndex])
    29. pods[bestPodIndex] = pods[len(pods)-1]
    30. pods = pods[:len(pods)-1]
    31. }
    32. return podsToEvict, nil
    33. }

    distance

    抢占优先级的核心算法。因为是每个资源需求维度的 distance 累加作为最终 distance,所以该算法是多维的。当然,累加的时候并没有系数,所以可以认定多种资源需求在算法中是平等对待的。(即不会优先驱逐 CPU 多的,或者优先驱逐占内存多的啦。)

  4. 计算选中 Pod 是否是最优驱逐 Pod。

  5. 遍历所有需求资源。
    1. 需求资源减去目标 Pod request 的资源量
      1. 如果还有剩余需求资源,计算剩余需求资源,并计算剩余需求资源占总资源的百分比的平方作为 distance 值,累加 distance 值。
  6. 返回 distance 值作为最佳驱逐 Pod 的评判标准。 :::info distance 描述了如果驱逐该 Pod,离目标资源需求还有多远。因此当然是 < 0 最好,驱逐一个 Pod 就满足需求啦。 :::
    1. // distance returns distance of the pods requests from the admissionRequirements.
    2. // The distance is measured by the fraction of the requirement satisfied by the pod,
    3. // so that each requirement is weighted equally, regardless of absolute magnitude.
    4. func (a admissionRequirementList) distance(pod *v1.Pod) float64 {
    5. dist := float64(0)
    6. for _, req := range a {
    7. remainingRequest := float64(req.quantity - resource.GetResourceRequest(pod, req.resourceName))
    8. if remainingRequest > 0 {
    9. dist += math.Pow(remainingRequest/float64(req.quantity), 2)
    10. }
    11. }
    12. return dist
    13. }

    Extension

    对于资源利用的商业化实践,LinuxFoundation 专门成立了 Finops 基金会。对此有兴趣的可以查看以下项目:

Preemption - 图1