Core
kubelet 的 preemption 模块负责处理高优先级的 Pod 抢占策略。详见:Pod 优先级和抢占 | Kubernetes :::info 如果一个 Pod 无法被调度,调度程序会尝试抢占(驱逐)较低优先级的 Pod, 以使高优先级 Pod 可以被调度。 :::
什么时候触发驱逐低优先级 Pod
当 Predicate 错误 slice 中只有资源不足导致的错误时才触发。
驱逐哪些 Pod
为了最小化影响,有一个优先级算法,描述如下。
影响 Pods 的优先级: guaranteed pods > burstable pods > besteffort pods (三个优先级是 QoS 中的定义 -> 配置 Pod 的服务质量 | Kubernetes)
最小影响的定义:fewest pods evicted > fewest total requests of pods (即驱逐的 Pod 数量越少越优,然后再考虑驱逐的 Pod request 的资源和越小越优)
:::info
从这里可以学到要保证服务稳定,就要申请 Guaranteed 级别的 Pod 优先级。
:::
Interface
该接口只有一个方法,可以看到触发 Pod 抢占的时机是 Predicate 失败的时候调用该接口方法。
// AdmissionFailureHandler is an interface which defines how to deal with a failure to admit a pod.
// This allows for the graceful handling of pod admission failure.
type AdmissionFailureHandler interface {
HandleAdmissionFailure(admitPod *v1.Pod, failureReasons []PredicateFailureReason) ([]PredicateFailureReason, error)
}
Structure
该模块比较简单,接口只需要实现一个方法即可。
// CriticalPodAdmissionHandler is an AdmissionFailureHandler that handles admission failure for Critical Pods.
// If the ONLY admission failures are due to insufficient resources, then CriticalPodAdmissionHandler evicts pods
// so that the critical pod can be admitted. For evictions, the CriticalPodAdmissionHandler evicts a set of pods that
// frees up the required resource requests. The set of pods is designed to minimize impact, and is prioritized according to the ordering:
// minimal impact for guaranteed pods > minimal impact for burstable pods > minimal impact for besteffort pods.
// minimal impact is defined as follows: fewest pods evicted > fewest total requests of pods.
// finding the fewest total requests of pods is considered besteffort.
type CriticalPodAdmissionHandler struct {
getPodsFunc eviction.ActivePodsFunc
killPodFunc eviction.KillPodFunc
recorder record.EventRecorder
}
HandleAdmissionFailure
实现如下:
- 检查目标 Pod 是否是高优先级 Pod。
- 为拒绝调度错误分类,资源不足错误一类,其他错误另一类。
- 判断是否有其他错误,有的话不做抢占逻辑直接返回错误。
- 确定驱逐低优先级 Pod ,这里会出现两种情况:
- 抢占成功,err 值 nil。
- 抢占失败,返回 err。
// HandleAdmissionFailure gracefully handles admission rejection, and, in some cases,
// to allow admission of the pod despite its previous failure.
func (c *CriticalPodAdmissionHandler) HandleAdmissionFailure(admitPod *v1.Pod, failureReasons []lifecycle.PredicateFailureReason) ([]lifecycle.PredicateFailureReason, error) {
if !kubetypes.IsCriticalPod(admitPod) {
return failureReasons, nil
}
// InsufficientResourceError is not a reason to reject a critical pod.
// Instead of rejecting, we free up resources to admit it, if no other reasons for rejection exist.
nonResourceReasons := []lifecycle.PredicateFailureReason{}
resourceReasons := []*admissionRequirement{}
for _, reason := range failureReasons {
if r, ok := reason.(*lifecycle.InsufficientResourceError); ok {
resourceReasons = append(resourceReasons, &admissionRequirement{
resourceName: r.ResourceName,
quantity: r.GetInsufficientAmount(),
})
} else {
nonResourceReasons = append(nonResourceReasons, reason)
}
}
if len(nonResourceReasons) > 0 {
// Return only reasons that are not resource related, since critical pods cannot fail admission for resource reasons.
return nonResourceReasons, nil
}
err := c.evictPodsToFreeRequests(admitPod, admissionRequirementList(resourceReasons))
// if no error is returned, preemption succeeded and the pod is safe to admit.
return nil, err
}
evictPodsToFreeRequests()
驱逐逻辑会检查是哪些资源不足,尝试抢占这类思源。
- 抢占是基于 requests 而不是 limit 的。
- 可能会驱逐大于高优先级 Pod 的 requests 值。
- 选出需要被驱逐的 Pods 们
- 挨个阻塞式调用 Kill 命令 kill pod 和他里面的 container
- 如果失败,只是这一次 kill 失败了,kill 操作已经提交,kubelet 会在自己的 syncPod 循环中重试(即失败直接尝试 kill 下一个)
更新抢占指标中的信息。
// evictPodsToFreeRequests takes a list of insufficient resources, and attempts to free them by evicting pods
// based on requests. For example, if the only insufficient resource is 200Mb of memory, this function could
// evict a pod with request=250Mb.
func (c *CriticalPodAdmissionHandler) evictPodsToFreeRequests(admitPod *v1.Pod, insufficientResources admissionRequirementList) error {
podsToPreempt, err := getPodsToPreempt(admitPod, c.getPodsFunc(), insufficientResources)
if err != nil {
return fmt.Errorf("preemption: error finding a set of pods to preempt: %v", err)
}
for _, pod := range podsToPreempt {
// record that we are evicting the pod
c.recorder.Eventf(pod, v1.EventTypeWarning, events.PreemptContainer, message)
// this is a blocking call and should only return when the pod and its containers are killed.
klog.V(3).InfoS("Preempting pod to free up resources", "pod", klog.KObj(pod), "podUID", pod.UID, "insufficientResources", insufficientResources)
err := c.killPodFunc(pod, true, nil, func(status *v1.PodStatus) {
status.Phase = v1.PodFailed
status.Reason = events.PreemptContainer
status.Message = message
})
if err != nil {
klog.ErrorS(err, "Failed to evict pod", "pod", klog.KObj(pod))
// In future syncPod loops, the kubelet will retry the pod deletion steps that it was stuck on.
continue
}
if len(insufficientResources) > 0 {
metrics.Preemptions.WithLabelValues(insufficientResources[0].resourceName.String()).Inc()
} else {
metrics.Preemptions.WithLabelValues("").Inc()
}
klog.InfoS("Pod evicted successfully", "pod", klog.KObj(pod))
}
return nil
}
getPodsToPreempt
从 kubelet 中获取当前所有 active 的 pods ,并按照 QoS 优先级排序,返回三类 QoS pods 的 slice 切片。
- 检查如果驱逐所有 bestEffortPods 和 burstablePods 都无法满足目标 pod 的资源要求的情况下,要驱逐的 GuaranteedPods
- 检查如果驱逐所有的 bestEffortPods 和上一步已经确定至少要驱逐的 GuaranteedPods 都无法满足 Pod 资源要求的情况下,需要驱逐的 burstablePods
- 检查如果驱逐上一步已经确定至少要驱逐的 burstablePods和上上一步已经确定至少要驱逐的 GuaranteedPods 都无法满足 Pod 资源要求的情况下,需要驱逐的 bestEffortPods
这里着重看一下顺序,为什么是先计算 guaranteedPods 先被计算驱逐呢?举一个例子好理解:
假如高优先级 Pod A 要求 100M 内存。burstablePods (总共 50m),bestEffortPods(总共 40m) 加起来也就总共 request 90M 内存,guaranteedPods 只有一个 Pod B request 90M 内存。要满足 A 的要求,怎么着也得驱逐 B 了吧。
然后按照顺序计算需要驱逐的 burstablePods ,驱逐 B 之后,A 还需要 10M 内存才满足需求呢。但是明显,应该驱逐低优先级的 bestEffortPods。这一次计算,就会发现需要满足的内存要求编程了 100M - 90M(B)-40M(bestEffortPods)= -30M。不需要驱逐 burstablePods 就够啦。
最后就计算到底要驱逐 bestEffortPods 中的多少 Pods 了。很明显,只需要再驱逐大于 10M Requests 的 bestEffortPods 即可。
// getPodsToPreempt returns a list of pods that could be preempted to free requests >= requirements
func getPodsToPreempt(pod *v1.Pod, pods []*v1.Pod, requirements admissionRequirementList) ([]*v1.Pod, error) {
bestEffortPods, burstablePods, guaranteedPods := sortPodsByQOS(pod, pods)
// make sure that pods exist to reclaim the requirements
unableToMeetRequirements := requirements.subtract(append(append(bestEffortPods, burstablePods...), guaranteedPods...)...)
if len(unableToMeetRequirements) > 0 {
return nil, fmt.Errorf("no set of running pods found to reclaim resources: %v", unableToMeetRequirements.toString())
}
// find the guaranteed pods we would need to evict if we already evicted ALL burstable and besteffort pods.
guaranteedToEvict, err := getPodsToPreemptByDistance(guaranteedPods, requirements.subtract(append(bestEffortPods, burstablePods...)...))
if err != nil {
return nil, err
}
// Find the burstable pods we would need to evict if we already evicted ALL besteffort pods, and the required guaranteed pods.
burstableToEvict, err := getPodsToPreemptByDistance(burstablePods, requirements.subtract(append(bestEffortPods, guaranteedToEvict...)...))
if err != nil {
return nil, err
}
// Find the besteffort pods we would need to evict if we already evicted the required guaranteed and burstable pods.
bestEffortToEvict, err := getPodsToPreemptByDistance(bestEffortPods, requirements.subtract(append(burstableToEvict, guaranteedToEvict...)...))
if err != nil {
return nil, err
}
return append(append(bestEffortToEvict, burstableToEvict...), guaranteedToEvict...), nil
}
getPodsToPreemptByDistance
- 进入检测循环,直到满足需求的 requests。即 insufficient resource 长度为零
- 遍历所有的 Pod,根据算法选出最适合驱逐的 Pod
- 这里算法下面详细讲,但是要清楚,算法是多维考虑的,比如有多个资源要求的时候(CPU、内存),会同时计算两个维度总和最佳驱逐 Pod
把需求的资源量减去选择出的最佳驱逐 Pod 的资源量,得到剩下还需要满足的资源列表。继续循环。
// getPodsToPreemptByDistance finds the pods that have pod requests >= admission requirements.
// Chooses pods that minimize "distance" to the requirements.
// If more than one pod exists that fulfills the remaining requirements,
// it chooses the pod that has the "smaller resource request"
// This method, by repeatedly choosing the pod that fulfills as much of the requirements as possible,
// attempts to minimize the number of pods returned.
func getPodsToPreemptByDistance(pods []*v1.Pod, requirements admissionRequirementList) ([]*v1.Pod, error) {
podsToEvict := []*v1.Pod{}
// evict pods by shortest distance from remaining requirements, updating requirements every round.
for len(requirements) > 0 {
if len(pods) == 0 {
return nil, fmt.Errorf("no set of running pods found to reclaim resources: %v", requirements.toString())
}
// all distances must be less than len(requirements), because the max distance for a single requirement is 1
bestDistance := float64(len(requirements) + 1)
bestPodIndex := 0
// Find the pod with the smallest distance from requirements
// Or, in the case of two equidistant pods, find the pod with "smaller" resource requests.
for i, pod := range pods {
dist := requirements.distance(pod)
if dist < bestDistance || (bestDistance == dist && smallerResourceRequest(pod, pods[bestPodIndex])) {
bestDistance = dist
bestPodIndex = i
}
}
// subtract the pod from requirements, and transfer the pod from input-pods to pods-to-evicted
requirements = requirements.subtract(pods[bestPodIndex])
podsToEvict = append(podsToEvict, pods[bestPodIndex])
pods[bestPodIndex] = pods[len(pods)-1]
pods = pods[:len(pods)-1]
}
return podsToEvict, nil
}
distance
抢占优先级的核心算法。因为是每个资源需求维度的 distance 累加作为最终 distance,所以该算法是多维的。当然,累加的时候并没有系数,所以可以认定多种资源需求在算法中是平等对待的。(即不会优先驱逐 CPU 多的,或者优先驱逐占内存多的啦。)
计算选中 Pod 是否是最优驱逐 Pod。
- 遍历所有需求资源。
- 需求资源减去目标 Pod request 的资源量
- 如果还有剩余需求资源,计算剩余需求资源,并计算剩余需求资源占总资源的百分比的平方作为 distance 值,累加 distance 值。
- 需求资源减去目标 Pod request 的资源量
- 返回 distance 值作为最佳驱逐 Pod 的评判标准。
:::info
distance 描述了如果驱逐该 Pod,离目标资源需求还有多远。因此当然是 < 0 最好,驱逐一个 Pod 就满足需求啦。
:::
// distance returns distance of the pods requests from the admissionRequirements.
// The distance is measured by the fraction of the requirement satisfied by the pod,
// so that each requirement is weighted equally, regardless of absolute magnitude.
func (a admissionRequirementList) distance(pod *v1.Pod) float64 {
dist := float64(0)
for _, req := range a {
remainingRequest := float64(req.quantity - resource.GetResourceRequest(pod, req.resourceName))
if remainingRequest > 0 {
dist += math.Pow(remainingRequest/float64(req.quantity), 2)
}
}
return dist
}
Extension
对于资源利用的商业化实践,LinuxFoundation 专门成立了 Finops 基金会。对此有兴趣的可以查看以下项目: