nav-img
Advanced

Priority-based Scheduling and Preemption

A pod priority indicates the importance of a pod relative to other pods. Volcano supports pod PriorityClasses in Kubernetes. After PriorityClasses are configured, the scheduler preferentially schedules high-priority pods. When cluster resources are insufficient, the scheduler will proactively evict low-priority pods to make it possible to schedule pending high-priority pods.

Prerequisites

Overview

The services running in a cluster are diversified, including core services, non-core services, online services, and offline services. You can configure priorities for different services based on service importance and SLA requirements. For example, configure a high priority for core services and online services so that such services preferentially obtain cluster resources. When cluster resources are used by non-core services and the remaining resources are insufficient for new core services, the scheduler evicts certain pods of non-core services to release the resources for scheduling the pods of the core services.

Table 1 lists the priority-based scheduling supported by CCE clusters.

Table 1 Priority-based scheduling

Scheduling Type

Description

Priority-based scheduling

The scheduler preferentially guarantees the running of high-priority pods, but will not evict low-priority pods that are running. Priority-based scheduling is enabled by default and cannot be disabled.

Priority-based preemption

When cluster resources are insufficient, the scheduler will proactively evict low-priority pods to make it possible to schedule pending high-priority pods.

Configuring Priority-based Scheduling and Preemption Policies

After Volcano is installed, you can enable or disable priority-based scheduling on the Scheduling page.

  1. Log in to the CCE console.
  2. Click the cluster name to access the cluster console. Choose Settings in the navigation pane. In the right pane, click the Scheduling tab.
  3. In the Business priority scheduling area, configure priority-based scheduling.

    • Scheduling based on priority: The scheduler preferentially guarantees the running of high-priority pods, but will not evict low-priority pods that are running. Priority-based scheduling is enabled by default and cannot be disabled.
    • Priority-based Preemption: If Volcano Scheduler is used as the default scheduler of the cluster, priority-based preemption is supported. When cluster resources are insufficient, the scheduler will proactively evict low-priority pods to make it possible to schedule pending high-priority pods.
      Note
      • After priority-based preemption is enabled, delayed pod creation is not allowed.
      • Priority-based preemption is not allowed on custom ENI/sub-ENI resources or host ports.

  4. Click Confirm.
  5. After the configuration, you can use PriorityClasses to schedule the pods of workloads or Volcano jobs based priorities.

    1. Create one or more PriorityClasses.
      apiVersion: scheduling.k8s.io/v1
      kind: PriorityClass
      metadata:
      name: high-priority
      value: 1000000
      globalDefault: false
      description: ""
    2. Create a workload or Volcano job and specify its PriorityClass name.
      • Workload
        apiVersion: apps/v1
        kind: Deployment
        metadata:
        name: high-test
        labels:
        app: high-test
        spec:
        replicas: 5
        selector:
        matchLabels:
        app: test
        template:
        metadata:
        labels:
        app: test
        spec:
        priorityClassName: high-priority
        schedulerName: volcano
        containers:
        - name: test
        image: busybox
        imagePullPolicy: IfNotPresent
        command: ['sh', '-c', 'echo "Hello, Kubernetes!" && sleep 3600']
        resources:
        requests:
        cpu: 500m
        limits:
        cpu: 500m
      • Volcano job
        apiVersion: batch.volcano.sh/v1alpha1
        kind: Job
        metadata:
        name: vcjob
        spec:
        schedulerName: volcano
        minAvailable: 4
        priorityClassName: high-priority
        tasks:
        - replicas: 4
        name: "test"
        template:
        spec:
        containers:
        - image: alpine
        command: ["/bin/sh", "-c", "sleep 1000"]
        imagePullPolicy: IfNotPresent
        name: running
        resources:
        requests:
        cpu: "1"
        restartPolicy: OnFailure

Example of Priority-based Scheduling

For example, if there are two idle nodes and high-, medium-, and low-priority Volcano jobs in the cluster, run the high-priority job first to exhaust the cluster resources. Then, run the medium- and low-priority jobs. The medium- and low-priority jobs are pending because the high-priority Volcano job is using all cluster resources. After the high-priority job completes, the medium-priority job is scheduled next.

  1. Add three PriorityClasses (high-priority, med-priority, and low-priority) in priority.yaml.

    Example configuration of priority.yaml:

    apiVersion: scheduling.k8s.io/v1
    kind: PriorityClass
    metadata:
    name: high-priority
    value: 100
    globalDefault: false
    description: "This priority class should be used for volcano job only."
    ---
    apiVersion: scheduling.k8s.io/v1
    kind: PriorityClass
    metadata:
    name: med-priority
    value: 50
    globalDefault: false
    description: "This priority class should be used for volcano job only."
    ---
    apiVersion: scheduling.k8s.io/v1
    kind: PriorityClass
    metadata:
    name: low-priority
    value: 10
    globalDefault: false
    description: "This priority class should be used for volcano job only."

    Create PriorityClasses.

    kubectl apply -f priority.yaml

  2. Check PriorityClasses.

    kubectl get PriorityClass

    Command output:

    NAME VALUE GLOBAL-DEFAULT AGE
    high-priority 100 false 97s
    low-priority 10 false 97s
    med-priority 50 false 97s
    system-cluster-critical 2000000000 false 6d6h
    system-node-critical 2000001000 false 6d6h

  3. Create a high-priority Volcano job to exhaust all cluster resources.

    high-priority-job.yaml

    apiVersion: batch.volcano.sh/v1alpha1
    kind: Job
    metadata:
    name: priority-high
    spec:
    schedulerName: volcano
    minAvailable: 4
    priorityClassName: high-priority
    tasks:
    - replicas: 4
    name: "test"
    template:
    spec:
    containers:
    - image: alpine
    command: ["/bin/sh", "-c", "sleep 1000"]
    imagePullPolicy: IfNotPresent
    name: running
    resources:
    requests:
    cpu: "1"
    restartPolicy: OnFailure

    Run the following command to issue the job:

    kubectl apply -f high-priority-job.yaml

    Run the kubectl get pod command to check pod statuses:

    NAME READY STATUS RESTARTS AGE
    priority-high-test-0 1/1 Running 0 3s
    priority-high-test-1 1/1 Running 0 3s
    priority-high-test-2 1/1 Running 0 3s
    priority-high-test-3 1/1 Running 0 3s

    The command output shows that all cluster resources have been used up.

  4. Create a medium-priority and a low-priority Volcano job.

    med-priority-job.yaml

    apiVersion: batch.volcano.sh/v1alpha1
    kind: Job
    metadata:
    name: priority-medium
    spec:
    schedulerName: volcano
    minAvailable: 4
    priorityClassName: med-priority
    tasks:
    - replicas: 4
    name: "test"
    template:
    spec:
    containers:
    - image: alpine
    command: ["/bin/sh", "-c", "sleep 1000"]
    imagePullPolicy: IfNotPresent
    name: running
    resources:
    requests:
    cpu: "1"
    restartPolicy: OnFailure

    low-priority-job.yaml

    apiVersion: batch.volcano.sh/v1alpha1
    kind: Job
    metadata:
    name: priority-low
    spec:
    schedulerName: volcano
    minAvailable: 4
    priorityClassName: low-priority
    tasks:
    - replicas: 4
    name: "test"
    template:
    spec:
    containers:
    - image: alpine
    command: ["/bin/sh", "-c", "sleep 1000"]
    imagePullPolicy: IfNotPresent
    name: running
    resources:
    requests:
    cpu: "1"
    restartPolicy: OnFailure

    Run the following commands to issue the jobs:

    kubectl apply -f med-priority-job.yaml
    kubectl apply -f low-priority-job.yaml

    Run the kubectl get pod command to check the statuses of the pods for the newly created workloads. The command output shows that the pods are pending due to insufficient resources:

    NAME READY STATUS RESTARTS AGE
    priority-high-test-0 1/1 Running 0 3m29s
    priority-high-test-1 1/1 Running 0 3m29s
    priority-high-test-2 1/1 Running 0 3m29s
    priority-high-test-3 1/1 Running 0 3m29s
    priority-low-test-0 0/1 Pending 0 2m26s
    priority-low-test-1 0/1 Pending 0 2m26s
    priority-low-test-2 0/1 Pending 0 2m26s
    priority-low-test-3 0/1 Pending 0 2m26s
    priority-medium-test-0 0/1 Pending 0 2m36s
    priority-medium-test-1 0/1 Pending 0 2m36s
    priority-medium-test-2 0/1 Pending 0 2m36s
    priority-medium-test-3 0/1 Pending 0 2m36s

  5. Delete the high-priority job to release cluster resources. The medium-priority job will be scheduled next.

    Run the kubectl delete -f high-priority-job.yaml command to release cluster resources and check pod scheduling.

    NAME READY STATUS RESTARTS AGE
    priority-low-test-0 0/1 Pending 0 5m18s
    priority-low-test-1 0/1 Pending 0 5m18s
    priority-low-test-2 0/1 Pending 0 5m18s
    priority-low-test-3 0/1 Pending 0 5m18s
    priority-medium-test-0 1/1 Running 0 5m28s
    priority-medium-test-1 1/1 Running 0 5m28s
    priority-medium-test-2 1/1 Running 0 5m28s
    priority-medium-test-3 1/1 Running 0 5m28s

Example of Priority-based Preemption

  1. Log in to the CCE console and click the cluster name to access the cluster console. Choose Settings in the navigation pane. In the right pane, click the Scheduling tab.
  2. Modify configurations.

    1. Select Volcano scheduler as the default cluster scheduler.
    2. Enable Scheduling based on priority.

  3. Issue the high_priority_job workload in the priority-based scheduling scenario. Then, the scheduler will evict the pods of the med_priority_job workload so that the pods of the high-priority workload can be scheduled.

    Run the kubectl apply -f high_priority_job.yaml command to issue the high-priority workload. Then, check pod statuses.

    NAME READY STATUS RESTARTS AGE
    priority-high-test-0 0/1 Pending 0 2s
    priority-high-test-1 0/1 Pending 0 2s
    priority-high-test-2 0/1 Pending 0 2s
    priority-high-test-3 0/1 Pending 0 2s
    priority-low-test-0 0/1 Pending 0 14s
    priority-low-test-1 0/1 Pending 0 14s
    priority-low-test-2 0/1 Pending 0 14s
    priority-low-test-3 0/1 Pending 0 14s
    priority-medium-test-0 1/1 Terminating 0 21s
    priority-medium-test-1 1/1 Terminating 0 21s
    priority-medium-test-2 1/1 Terminating 0 21s
    priority-medium-test-3 1/1 Terminating 0 21s

    After the resources used by the med_priority_job resource workload are released, the pods of the high_priority_job workload can be scheduled.

    NAME READY STATUS RESTARTS AGE
    priority-high-test-0 1/1 Running 0 70s
    priority-high-test-1 1/1 Running 0 70s
    priority-high-test-2 1/1 Running 0 70s
    priority-high-test-3 1/1 Running 0 70s
    priority-low-test-0 0/1 Pending 0 82s
    priority-low-test-1 0/1 Pending 0 82s
    priority-low-test-2 0/1 Pending 0 82s
    priority-low-test-3 0/1 Pending 0 82s
    priority-medium-test-0 0/1 Pending 0 37s
    priority-medium-test-1 0/1 Pending 0 36s
    priority-medium-test-2 0/1 Pending 0 37s
    priority-medium-test-3 0/1 Pending 0 37s

    When node resources cannot meet the high_priority_job requirements, priority-based preemption of volcano-scheduler will be enabled. The pods of med_priority_job will be evicted for the deployment of high_priority_job. After new nodes are added using Cluster Autoscaler, volcano-scheduler will schedule med_priority_job pods to the new nodes.

    According to the preceding test results, enable node scaling if priority-based preemption is enabled so that cluster resources can be allocated on demand to ensure service SLA.

Example of Affinity and Anti-affinity for Priority-based Preemption

Do not configure inter-pod affinity on the pods with lower priorities. If a pod in the pending state is inter-pod affinity with one or more pods with lower priorities on the node, the pod affinity rule cannot be met when preemption is initiated for the pods with lower priorities, and the preemption rule conflicts with the affinity rule. In this case, the scheduler cannot ensure the scheduling of the pending pod. To resolve this issue, configure inter-pod affinity only for pods with the same or higher priority. For details, see Inter-pod affinity on lower-priority pods.

In inter-pod affinity, if priority-based preemption is enabled and deploy1 is affinity with lower-priority deploy2, volcano-scheduler will evict deploy3 and schedule deploy1 to the node to ensure service O&M. The evicted deploy3 will be scheduled to the new node after the new node is ready.

Figure 1 Inter-pod affinity on lower-priority pods


In inter-pod anti-affinity, if priority-based preemption is enabled and deploy1 is anti-affinity with deploy2 and deploy3, volcano-scheduler will not evict deploy2 and deploy3 to minimize the impact on other services. Instead, the scheduler will schedule deploy1 to a new node.

Figure 2 Inter-pod anti-affinity on lower-priority pods