Kubelet - Preemption - 《Kubernetes 源码架构图》

Core
- 什么时候触发驱逐低优先级 Pod
- 驱逐哪些 Pod
Interface
Structure
HandleAdmissionFailure
Extension

Core

kubelet 的 preemption 模块负责处理高优先级的 Pod 抢占策略。详见：Pod 优先级和抢占 | Kubernetes :::info 如果一个 Pod 无法被调度，调度程序会尝试抢占（驱逐）较低优先级的 Pod，以使高优先级 Pod 可以被调度。 :::

什么时候触发驱逐低优先级 Pod

当 Predicate 错误 slice 中只有资源不足导致的错误时才触发。

驱逐哪些 Pod

为了最小化影响，有一个优先级算法，描述如下。
影响 Pods 的优先级： guaranteed pods > burstable pods > besteffort pods （三个优先级是 QoS 中的定义 -> 配置 Pod 的服务质量 | Kubernetes）
最小影响的定义：fewest pods evicted > fewest total requests of pods （即驱逐的 Pod 数量越少越优，然后再考虑驱逐的 Pod request 的资源和越小越优） :::info 从这里可以学到要保证服务稳定，就要申请 Guaranteed 级别的 Pod 优先级。 :::

Interface

该接口只有一个方法，可以看到触发 Pod 抢占的时机是 Predicate 失败的时候调用该接口方法。

// AdmissionFailureHandler is an interface which defines how to deal with a failure to admit a pod.
// This allows for the graceful handling of pod admission failure.
type AdmissionFailureHandler interface {
    HandleAdmissionFailure(admitPod *v1.Pod, failureReasons []PredicateFailureReason) ([]PredicateFailureReason, error)
}

Structure

该模块比较简单，接口只需要实现一个方法即可。

// CriticalPodAdmissionHandler is an AdmissionFailureHandler that handles admission failure for Critical Pods.
// If the ONLY admission failures are due to insufficient resources, then CriticalPodAdmissionHandler evicts pods
// so that the critical pod can be admitted.  For evictions, the CriticalPodAdmissionHandler evicts a set of pods that
// frees up the required resource requests.  The set of pods is designed to minimize impact, and is prioritized according to the ordering:
// minimal impact for guaranteed pods > minimal impact for burstable pods > minimal impact for besteffort pods.
// minimal impact is defined as follows: fewest pods evicted > fewest total requests of pods.
// finding the fewest total requests of pods is considered besteffort.
type CriticalPodAdmissionHandler struct {
    getPodsFunc eviction.ActivePodsFunc
    killPodFunc eviction.KillPodFunc
    recorder    record.EventRecorder
}

注释将功能说明的非常清楚，已在 Core 段上文列出。

HandleAdmissionFailure

实现如下：

检查目标 Pod 是否是高优先级 Pod。
为拒绝调度错误分类，资源不足错误一类，其他错误另一类。
判断是否有其他错误，有的话不做抢占逻辑直接返回错误。

确定驱逐低优先级 Pod ，这里会出现两种情况：

抢占成功，err 值 nil。

抢占失败，返回 err。

// HandleAdmissionFailure gracefully handles admission rejection, and, in some cases,
// to allow admission of the pod despite its previous failure.
func (c *CriticalPodAdmissionHandler) HandleAdmissionFailure(admitPod *v1.Pod, failureReasons []lifecycle.PredicateFailureReason) ([]lifecycle.PredicateFailureReason, error) {
if !kubetypes.IsCriticalPod(admitPod) {
  return failureReasons, nil
}
// InsufficientResourceError is not a reason to reject a critical pod.
// Instead of rejecting, we free up resources to admit it, if no other reasons for rejection exist.
nonResourceReasons := []lifecycle.PredicateFailureReason{}
resourceReasons := []*admissionRequirement{}
for _, reason := range failureReasons {
  if r, ok := reason.(*lifecycle.InsufficientResourceError); ok {
      resourceReasons = append(resourceReasons, &admissionRequirement{
          resourceName: r.ResourceName,
          quantity:     r.GetInsufficientAmount(),
      })
  } else {
      nonResourceReasons = append(nonResourceReasons, reason)
  }
}
if len(nonResourceReasons) > 0 {
  // Return only reasons that are not resource related, since critical pods cannot fail admission for resource reasons.
  return nonResourceReasons, nil
}
err := c.evictPodsToFreeRequests(admitPod, admissionRequirementList(resourceReasons))
// if no error is returned, preemption succeeded and the pod is safe to admit.
return nil, err
}

evictPodsToFreeRequests()

驱逐逻辑会检查是哪些资源不足，尝试抢占这类思源。

抢占是基于 requests 而不是 limit 的。
可能会驱逐大于高优先级 Pod 的 requests 值。

选出需要被驱逐的 Pods 们
挨个阻塞式调用 Kill 命令 kill pod 和他里面的 container
1. 如果失败，只是这一次 kill 失败了，kill 操作已经提交，kubelet 会在自己的 syncPod 循环中重试（即失败直接尝试 kill 下一个）

更新抢占指标中的信息。

// evictPodsToFreeRequests takes a list of insufficient resources, and attempts to free them by evicting pods
// based on requests.  For example, if the only insufficient resource is 200Mb of memory, this function could
// evict a pod with request=250Mb.
func (c *CriticalPodAdmissionHandler) evictPodsToFreeRequests(admitPod *v1.Pod, insufficientResources admissionRequirementList) error {
 podsToPreempt, err := getPodsToPreempt(admitPod, c.getPodsFunc(), insufficientResources)
 if err != nil {
     return fmt.Errorf("preemption: error finding a set of pods to preempt: %v", err)
 }
 for _, pod := range podsToPreempt {
     // record that we are evicting the pod
     c.recorder.Eventf(pod, v1.EventTypeWarning, events.PreemptContainer, message)
     // this is a blocking call and should only return when the pod and its containers are killed.
     klog.V(3).InfoS("Preempting pod to free up resources", "pod", klog.KObj(pod), "podUID", pod.UID, "insufficientResources", insufficientResources)
     err := c.killPodFunc(pod, true, nil, func(status *v1.PodStatus) {
         status.Phase = v1.PodFailed
         status.Reason = events.PreemptContainer
         status.Message = message
     })
     if err != nil {
         klog.ErrorS(err, "Failed to evict pod", "pod", klog.KObj(pod))
         // In future syncPod loops, the kubelet will retry the pod deletion steps that it was stuck on.
         continue
     }
     if len(insufficientResources) > 0 {
         metrics.Preemptions.WithLabelValues(insufficientResources[0].resourceName.String()).Inc()
     } else {
         metrics.Preemptions.WithLabelValues("").Inc()
     }
     klog.InfoS("Pod evicted successfully", "pod", klog.KObj(pod))
 }
 return nil
}

getPodsToPreempt

从 kubelet 中获取当前所有 active 的 pods ，并按照 QoS 优先级排序，返回三类 QoS pods 的 slice 切片。
检查如果驱逐所有 bestEffortPods 和 burstablePods 都无法满足目标 pod 的资源要求的情况下，要驱逐的 GuaranteedPods
检查如果驱逐所有的 bestEffortPods 和上一步已经确定至少要驱逐的 GuaranteedPods 都无法满足 Pod 资源要求的情况下，需要驱逐的 burstablePods
检查如果驱逐上一步已经确定至少要驱逐的 burstablePods和上上一步已经确定至少要驱逐的 GuaranteedPods 都无法满足 Pod 资源要求的情况下，需要驱逐的 bestEffortPods

这里着重看一下顺序，为什么是先计算 guaranteedPods 先被计算驱逐呢？举一个例子好理解：
假如高优先级 Pod A 要求 100M 内存。burstablePods （总共 50m），bestEffortPods（总共 40m）加起来也就总共 request 90M 内存，guaranteedPods 只有一个 Pod B request 90M 内存。要满足 A 的要求，怎么着也得驱逐 B 了吧。
然后按照顺序计算需要驱逐的 burstablePods ，驱逐 B 之后，A 还需要 10M 内存才满足需求呢。但是明显，应该驱逐低优先级的 bestEffortPods。这一次计算，就会发现需要满足的内存要求编程了 100M - 90M（B）-40M（bestEffortPods）= -30M。不需要驱逐 burstablePods 就够啦。
最后就计算到底要驱逐 bestEffortPods 中的多少 Pods 了。很明显，只需要再驱逐大于 10M Requests 的 bestEffortPods 即可。

// getPodsToPreempt returns a list of pods that could be preempted to free requests >= requirements
func getPodsToPreempt(pod *v1.Pod, pods []*v1.Pod, requirements admissionRequirementList) ([]*v1.Pod, error) {
    bestEffortPods, burstablePods, guaranteedPods := sortPodsByQOS(pod, pods)
    // make sure that pods exist to reclaim the requirements
    unableToMeetRequirements := requirements.subtract(append(append(bestEffortPods, burstablePods...), guaranteedPods...)...)
    if len(unableToMeetRequirements) > 0 {
        return nil, fmt.Errorf("no set of running pods found to reclaim resources: %v", unableToMeetRequirements.toString())
    }
    // find the guaranteed pods we would need to evict if we already evicted ALL burstable and besteffort pods.
    guaranteedToEvict, err := getPodsToPreemptByDistance(guaranteedPods, requirements.subtract(append(bestEffortPods, burstablePods...)...))
    if err != nil {
        return nil, err
    }
    // Find the burstable pods we would need to evict if we already evicted ALL besteffort pods, and the required guaranteed pods.
    burstableToEvict, err := getPodsToPreemptByDistance(burstablePods, requirements.subtract(append(bestEffortPods, guaranteedToEvict...)...))
    if err != nil {
        return nil, err
    }
    // Find the besteffort pods we would need to evict if we already evicted the required guaranteed and burstable pods.
    bestEffortToEvict, err := getPodsToPreemptByDistance(bestEffortPods, requirements.subtract(append(burstableToEvict, guaranteedToEvict...)...))
    if err != nil {
        return nil, err
    }
    return append(append(bestEffortToEvict, burstableToEvict...), guaranteedToEvict...), nil
}

getPodsToPreemptByDistance

进入检测循环，直到满足需求的 requests。即 insufficient resource 长度为零
遍历所有的 Pod，根据算法选出最适合驱逐的 Pod
1. 这里算法下面详细讲，但是要清楚，算法是多维考虑的，比如有多个资源要求的时候（CPU、内存），会同时计算两个维度总和最佳驱逐 Pod

把需求的资源量减去选择出的最佳驱逐 Pod 的资源量，得到剩下还需要满足的资源列表。继续循环。

// getPodsToPreemptByDistance finds the pods that have pod requests >= admission requirements.
// Chooses pods that minimize "distance" to the requirements.
// If more than one pod exists that fulfills the remaining requirements,
// it chooses the pod that has the "smaller resource request"
// This method, by repeatedly choosing the pod that fulfills as much of the requirements as possible,
// attempts to minimize the number of pods returned.
func getPodsToPreemptByDistance(pods []*v1.Pod, requirements admissionRequirementList) ([]*v1.Pod, error) {
 podsToEvict := []*v1.Pod{}
 // evict pods by shortest distance from remaining requirements, updating requirements every round.
 for len(requirements) > 0 {
     if len(pods) == 0 {
         return nil, fmt.Errorf("no set of running pods found to reclaim resources: %v", requirements.toString())
     }
     // all distances must be less than len(requirements), because the max distance for a single requirement is 1
     bestDistance := float64(len(requirements) + 1)
     bestPodIndex := 0
     // Find the pod with the smallest distance from requirements
     // Or, in the case of two equidistant pods, find the pod with "smaller" resource requests.
     for i, pod := range pods {
         dist := requirements.distance(pod)
         if dist < bestDistance || (bestDistance == dist && smallerResourceRequest(pod, pods[bestPodIndex])) {
             bestDistance = dist
             bestPodIndex = i
         }
     }
     // subtract the pod from requirements, and transfer the pod from input-pods to pods-to-evicted
     requirements = requirements.subtract(pods[bestPodIndex])
     podsToEvict = append(podsToEvict, pods[bestPodIndex])
     pods[bestPodIndex] = pods[len(pods)-1]
     pods = pods[:len(pods)-1]
 }
 return podsToEvict, nil
}

distance

抢占优先级的核心算法。因为是每个资源需求维度的 distance 累加作为最终 distance，所以该算法是多维的。当然，累加的时候并没有系数，所以可以认定多种资源需求在算法中是平等对待的。（即不会优先驱逐 CPU 多的，或者优先驱逐占内存多的啦。）

计算选中 Pod 是否是最优驱逐 Pod。
遍历所有需求资源。
1. 需求资源减去目标 Pod request 的资源量
  1. 如果还有剩余需求资源，计算剩余需求资源，并计算剩余需求资源占总资源的百分比的平方作为 distance 值，累加 distance 值。

返回 distance 值作为最佳驱逐 Pod 的评判标准。 :::info distance 描述了如果驱逐该 Pod，离目标资源需求还有多远。因此当然是 < 0 最好，驱逐一个 Pod 就满足需求啦。 :::

// distance returns distance of the pods requests from the admissionRequirements.
// The distance is measured by the fraction of the requirement satisfied by the pod,
// so that each requirement is weighted equally, regardless of absolute magnitude.
func (a admissionRequirementList) distance(pod *v1.Pod) float64 {
 dist := float64(0)
 for _, req := range a {
     remainingRequest := float64(req.quantity - resource.GetResourceRequest(pod, req.resourceName))
     if remainingRequest > 0 {
         dist += math.Pow(remainingRequest/float64(req.quantity), 2)
     }
 }
 return dist
}

Extension

对于资源利用的商业化实践，LinuxFoundation 专门成立了 Finops 基金会。对此有兴趣的可以查看以下项目：

Finops 的实现 gocrane/crane: Crane (FinOps Crane) is an opensource project which manages cloud resource on Kubernetes stack, it is inspired by FinOps concepts.

Preemption - 图1