背景

一个任务需要最小实例数目,任务实例数目。最小实例数目满足了,这个任务就可以正常启动。实例数目用户期望实例数目。如果当资源足够情况启动任务所要求实例,当资源满了,如果满足最小实例数目,任务按照最低要求调度

同一队列上调度

例如在queue1,CPU限制10核,reclaimable: false 不可以伸缩。配置如下:

  1. apiVersion: scheduling.volcano.sh/v1beta1
  2. kind: Queue
  3. metadata:
  4. name: queue1
  5. spec:
  6. weight: 1
  7. reclaimable: false
  8. capability:
  9. cpu: 10 #cpu 最大使用个数

job1任务最小满足个数为5, 最大运行个数10。每个pod使用1CPU
elastic-scheduler-job1-1.png

  1. apiVersion: batch.volcano.sh/v1alpha1
  2. kind: Job
  3. metadata:
  4. name: job1
  5. spec:
  6. minAvailable: 5
  7. schedulerName: volcano
  8. queue: queue1
  9. tasks:
  10. - replicas: 10
  11. name: task1
  12. template:
  13. spec:
  14. containers:
  15. - image: nginx
  16. name: job1-workers
  17. resources:
  18. requests:
  19. cpu: "1"
  20. limits:
  21. cpu: "1"

job2 同样配置。

  1. apiVersion: batch.volcano.sh/v1alpha1
  2. kind: Job
  3. metadata:
  4. name: job2
  5. spec:
  6. schedulerName: volcano
  7. queue: queue1
  8. minAvailable: 5
  9. tasks:
  10. - replicas: 10
  11. name: task1
  12. template:
  13. spec:
  14. containers:
  15. - image: nginx
  16. name: job2-workers
  17. resources:
  18. requests:
  19. cpu: "1"
  20. limits:
  21. cpu: "1"

同时提交job1,job2

job1和job2总最大CPU需求是20, 大于queue 10 cpu限制,按照最低需求启动这两个job。最低需求5个。queue1有足够资源调度两个任务,各启最小配置。

  1. $ cp job1.yaml job2.yaml minAvailable-same-queue/
  2. $ cd minAvailable-same-queue/
  3. $ kubectl apply -f ./
  4. $ kubectl get pod -n default
  5. NAME READY STATUS RESTARTS AGE
  6. job1-task1-0 0/1 Pending 0 18m
  7. job1-task1-1 1/1 Running 0 18m
  8. job1-task1-2 1/1 Running 0 18m
  9. job1-task1-3 0/1 Pending 0 18m
  10. job1-task1-4 1/1 Running 0 18m
  11. job1-task1-5 1/1 Running 0 18m
  12. job1-task1-6 1/1 Running 0 18m
  13. job1-task1-7 0/1 Pending 0 18m
  14. job1-task1-8 0/1 Pending 0 18m
  15. job1-task1-9 0/1 Pending 0 18m
  16. ....
  17. job2-task1-0 1/1 Running 0 18m
  18. job2-task1-1 0/1 Pending 0 18m
  19. job2-task1-2 1/1 Running 0 18m
  20. job2-task1-3 1/1 Running 0 18m
  21. job2-task1-4 0/1 Pending 0 18m
  22. job2-task1-5 0/1 Pending 0 18m
  23. job2-task1-6 1/1 Running 0 18m
  24. job2-task1-7 0/1 Pending 0 18m
  25. job2-task1-8 1/1 Running 0 18m
  26. job2-task1-9 0/1 Pending 0 18m

增加队列的资源

调整queue1,增加两个核,看看运行任务变化

  1. $ kubectl edit queue queue1
  2. apiVersion: scheduling.volcano.sh/v1beta1
  3. kind: Queue
  4. metadata:
  5. creationTimestamp: "2021-12-22T06:57:56Z"
  6. generation: 1
  7. name: queue1
  8. resourceVersion: "43629640"
  9. selfLink: /apis/scheduling.volcano.sh/v1beta1/queues/queue1
  10. uid: 44940e23-804e-4bcf-941e-12f07d149c37
  11. spec:
  12. capability:
  13. cpu: 12
  14. reclaimable: false
  15. weight: 1
  16. status:
  17. reservation: {}
  18. running: 2
  19. state: Open

保存以后观察pod变化

  1. $ kubectl get pod
  2. NAME READY STATUS RESTARTS AGE
  3. job1-task1-0 0/1 Pending 0 28m
  4. job1-task1-1 1/1 Running 0 28m
  5. job1-task1-2 1/1 Running 0 28m
  6. job1-task1-3 0/1 Pending 0 28m
  7. job1-task1-4 1/1 Running 0 28m
  8. job1-task1-5 1/1 Running 0 28m
  9. job1-task1-6 1/1 Running 0 28m
  10. job1-task1-7 0/1 Pending 0 28m
  11. job1-task1-8 0/1 Pending 0 28m
  12. job1-task1-9 0/1 ContainerCreating 0 28m
  13. ...
  14. job2-task1-0 1/1 Running 0 28m
  15. job2-task1-1 0/1 Pending 0 28m
  16. job2-task1-2 1/1 Running 0 28m
  17. job2-task1-3 1/1 Running 0 28m
  18. job2-task1-4 0/1 Pending 0 28m
  19. job2-task1-5 0/1 Pending 0 28m
  20. job2-task1-6 1/1 Running 0 28m
  21. job2-task1-7 0/1 ContainerCreating 0 28m
  22. job2-task1-8 1/1 Running 0 28m
  23. job2-task1-9 0/1 Pending 0 28m

job1, job2 elastic pod 分别增加了一个

减少队列资源

调整queue1,增加两个核,看看运行任务变化

  1. $ kubectl edit queue queue1
  2. apiVersion: scheduling.volcano.sh/v1beta1
  3. kind: Queue
  4. metadata:
  5. creationTimestamp: "2021-12-22T06:57:56Z"
  6. generation: 1
  7. name: queue1
  8. resourceVersion: "43629640"
  9. selfLink: /apis/scheduling.volcano.sh/v1beta1/queues/queue1
  10. uid: 44940e23-804e-4bcf-941e-12f07d149c37
  11. spec:
  12. capability:
  13. cpu: 8
  14. reclaimable: false
  15. weight: 1
  16. status:
  17. reservation: {}
  18. running: 2
  19. state: Open

保存以后观察pod变化

  1. $ kubectl get pod
  2. NAME READY STATUS RESTARTS AGE
  3. job1-task1-0 0/1 Pending 0 35m
  4. job1-task1-1 1/1 Running 0 35m
  5. job1-task1-2 1/1 Running 0 35m
  6. job1-task1-3 0/1 Pending 0 35m
  7. job1-task1-4 1/1 Running 0 35m
  8. job1-task1-5 1/1 Running 0 35m
  9. job1-task1-6 1/1 Running 0 35m
  10. job1-task1-7 0/1 Pending 0 35m
  11. job1-task1-8 0/1 Pending 0 35m
  12. job1-task1-9 1/1 Running 0 35m
  13. ...
  14. job2-task1-0 1/1 Running 0 35m
  15. job2-task1-1 0/1 Pending 0 35m
  16. job2-task1-2 1/1 Running 0 35m
  17. job2-task1-3 1/1 Running 0 35m
  18. job2-task1-4 0/1 Pending 0 35m
  19. job2-task1-5 0/1 Pending 0 35m
  20. job2-task1-6 1/1 Running 0 35m
  21. job2-task1-7 1/1 Running 0 35m
  22. job2-task1-8 1/1 Running 0 35m
  23. job2-task1-9 0/1 Pending 0 35m

注意: POD 启动以后减少队列资源不会触发退出pod

不同队列上调度

先后提交job1, job2

在两个队列连分别提交两个job, 看看抢占资源效果。这次使用gpu资源进行调度

Queue1

  1. apiVersion: scheduling.volcano.sh/v1beta1
  2. kind: Queue
  3. metadata:
  4. name: queue1
  5. spec:
  6. weight: 1
  7. reclaimable: true

Queue2:

  1. apiVersion: scheduling.volcano.sh/v1beta1
  2. kind: Queue
  3. metadata:
  4. name: queue2
  5. spec:
  6. weight: 1
  7. reclaimable: true

系统显卡资源:

  1. $ nvidia_smi
  2. Thu Dec 23 09:56:28 2021
  3. +-----------------------------------------------------------------------------+
  4. | NVIDIA-SMI 470.63.01 Driver Version: 470.63.01 CUDA Version: 11.4 |
  5. |-------------------------------+----------------------+----------------------+
  6. | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
  7. | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
  8. | | | MIG M. |
  9. |===============================+======================+======================|
  10. | 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
  11. | 0% 23C P8 11W / 350W | 2MiB / 24268MiB | 0% Default |
  12. | | | N/A |
  13. +-------------------------------+----------------------+----------------------+
  14. | 1 NVIDIA GeForce ... Off | 00000000:41:00.0 Off | N/A |
  15. | 0% 23C P8 10W / 350W | 2MiB / 24268MiB | 0% Default |
  16. | | | N/A |
  17. +-------------------------------+----------------------+----------------------+
  18. | 2 NVIDIA GeForce ... Off | 00000000:A1:00.0 Off | N/A |
  19. | 0% 23C P8 8W / 350W | 2MiB / 24268MiB | 0% Default |
  20. | | | N/A |
  21. +-------------------------------+----------------------+----------------------+
  22. | 3 NVIDIA GeForce ... Off | 00000000:C1:00.0 Off | N/A |
  23. | 0% 23C P8 11W / 350W | 2MiB / 24268MiB | 0% Default |
  24. | | | N/A |
  25. +-------------------------------+----------------------+----------------------+
  26. $ kubectl describe nodes
  27. ...
  28. volcano.sh/gpu-memory: 97072
  29. volcano.sh/gpu-number: 4
  30. ...

4 张卡每张卡24268MiB

按照要求每个queue可以占用2张卡, 2*24268MiB 显存,由于是动态配置,资源足够情况各个队列分配超过weight资源。例如Job1占用4卡资源,yaml如下:
job1.yaml

  1. apiVersion: batch.volcano.sh/v1alpha1
  2. kind: Job
  3. metadata:
  4. name: job1
  5. spec:
  6. minAvailable: 2
  7. schedulerName: volcano
  8. queue: queue1
  9. tasks:
  10. - replicas: 4
  11. name: task1
  12. template:
  13. spec:
  14. containers:
  15. - image: nginx
  16. name: job1-workers
  17. resources:
  18. requests:
  19. cpu: "1"
  20. limits:
  21. cpu: "1"
  22. volcano.sh/gpu-memory: "24268"

job1: minAvailable2: 调度器按照minAvailable2最低要求分配资源job1, 如果资源满足调度启动POD, 如果不满足最低要,任务无法执行,先考虑其他任务了。

  1. $ kubectl apply -f job1.yaml
  2. $ kubectl get pod
  3. NAME READY STATUS RESTARTS AGE
  4. job1-task1-0 1/1 Running 0 5m40s
  5. job1-task1-1 1/1 Running 0 5m40s
  6. job1-task1-2 1/1 Running 0 5m40s
  7. job1-task1-3 1/1 Running 0 5m40s

然后再启动job2, job2是使用queue2进行调度的

  1. apiVersion: batch.volcano.sh/v1alpha1
  2. kind: Job
  3. metadata:
  4. name: job2
  5. spec:
  6. schedulerName: volcano
  7. queue: queue2
  8. minAvailable: 2
  9. tasks:
  10. - replicas: 4
  11. name: task1
  12. template:
  13. spec:
  14. containers:
  15. - image: nginx
  16. name: job2-workers
  17. resources:
  18. requests:
  19. cpu: "1"
  20. limits:
  21. cpu: "1"
  22. volcano.sh/gpu-memory: "24268"

提交job2.yaml

  1. $ kubectl apply -f job2.yaml
  2. $ kubectl get pod
  3. NAME READY STATUS RESTARTS AGE
  4. job1-task1-0 1/1 Running 0 7m36s
  5. job1-task1-1 1/1 Running 0 7m36s
  6. job1-task1-2 1/1 Running 0 7m36s
  7. job1-task1-3 1/1 Running 0 7m36s
  8. job2-task1-0 0/1 Pending 0 6s
  9. job2-task1-1 0/1 Pending 0 6s
  10. job2-task1-2 0/1 Pending 0 6s
  11. job2-task1-3 0/1 Pending 0 6s
  12. $ kubectl describe vcjob job2
  13. ...
  14. Warning PodGroupPending 65s (x3 over 65s) vc-controller-manager PodGroup default:job2 unschedule,reason: 2/4 tasks in gang unschedulable: pod group is not ready, 2 minAvailable, 4 Pending; Pending: 1 Unschedulable, 3 Undetermined
  15. ...

调度器不会主动驱逐job1任务实例,让job2占资源,但是可以手动关闭job1两个任务, job1-task1-0,job1-task1-1:

  1. $ kubectl delete pod job1-task1-0 job1-task1-1
  2. $ kubectl get pod
  3. NAME READY STATUS RESTARTS AGE
  4. job1-task1-0 0/1 Pending 0 13s
  5. job1-task1-1 0/1 Pending 0 5s
  6. job1-task1-2 1/1 Running 0 117m
  7. job1-task1-3 1/1 Running 0 117m
  8. job2-task1-0 0/1 ContainerCreating 0 110m
  9. job2-task1-1 0/1 Pending 0 110m
  10. job2-task1-2 0/1 ContainerCreating 0 110m
  11. job2-task1-3 0/1 Pending 0 110m

由于queue1超配资源,释放连个pod以后,刚好到job2启动最少gpu资源,调度有些启动job2两个pod。