nav-img
Advanced

CCE AI Suite (NVIDIA GPU)

Add-on Overview

CCE AI Suite (NVIDIA GPU) is a device management add-on that supports GPUs in containers. To use GPU nodes in a cluster, this add-on must be installed.

Add-on Parameters

Table 1 Parameters

Parameter

Mandatory

Type

Description

basic

Yes

object

Basic add-on configuration parameters

custom

Yes

Table 3 object

Custom parameters

Table 2 Configuration of basic

Parameter

Mandatory

Type

Description

cluster_version

No

String

CCE cluster version

device_version

Yes

String

Add-on version

driver_version

Yes

String

Image tag of an add-on pod where a driver is installed. Generally, the value is the same as that of device_version.

obs_url

Yes

String

When a GPU driver is downloaded from the default driver address, the value is the GPU driver address.

swr_addr

Yes

String

Image repository address

swr_user

Yes

String

Tenant path of an image repository

Table 3 Configuration of custom

Parameter

Mandatory

Type

Description

compatible_with_legacy_api

No

Bool

API compatibility switch

Default value: false

true: The add-on supports the GPU native mode and xGPU virtualization.

component_schedulername

Yes

String

Name of the scheduler used by the add-on.

Default value: default-scheduler

disable_mount_path_v1

No

Bool

Default value: false

true: /opt/cloud/cce/nvidia is not mounted to the /usr/lib/nvidia directory of a GPU container.

disable_nvidia_gsp

No

Bool

Default value: true

true: The GPU GSP firmware is disabled.

driver_mount_paths

No

String

Driver file directory that needs to be automatically mounted to a GPU container

Default value: "bin,lib64"

enable_fault_isolation

No

Bool

Default value: true

true: The add-on detects hardware faults or driver issues of a GPU and then sets the GPU to be unavailable.

enable_health_monitoring

No

Bool

Default value: true

true: The add-on detects hardware faults or driver issues of a GPU.

enable_metrics_monitoring

No

Bool

Default value: true

true: The add-on collects GPU metrics and reports these metrics to Prometheus.

enable_simple_lib64_mount

No

Bool

Default value: true

true: Only the libxxx.so.x file is mounted to a container.

enable_xgpu

No

Bool

Default value: false

Whether to enable xGPU virtualization.

gpu_driver_config

No

Map

Configurations of the GPU driver for a single node pool

Default value: {}

health_check_xids_v2

No

String

GPU error range for the add-on health checks

Default value: "74,79"

inject_ld_Library_path

No

String

Value of the LD_LIBRARY_PATH environment variable automatically injected by the add-on to a GPU container

Default value: ""

lib64_container_paths

No

String

Mount path of NVIDIA lib64 in a GPU container

Default value: "/usr/lib64,/usr/lib/x86_64-linux-gnu"

metrics_delete_interval

No

int

Timeout threshold for deleting a metric when the metric cannot be obtained. The unit is millisecond.

Default value: 30000

metrics_monitor_interval

No

int

Interval for obtaining metrics, in milliseconds.

Default value: 15000

nvidia_driver_download_url

Yes

String

Path for downloading the NVIDIA driver

Default value: ""

Example Request

{
"kind": "Addon",
"apiVersion": "v3",
"metadata": {
"name": "gpu-beta",
},
"spec": {
"clusterID": "80c9e306-***-***-***-0255ac100043",
"version": "2.0.69",
"addonTemplateName": "gpu-beta",
"values": {
"basic": {
"cluster_version": "v1.27",
"device_version": "2.0.69",
"driver_version": "2.0.69",
"obs_url": "***",
"region": "***",
"swr_addr": "***",
"swr_user": "***"
},
"custom": {
"compatible_with_legacy_api": true,
"component_schedulername": "kube-scheduler",
"disable_mount_path_v1": false,
"disable_nvidia_gsp": true,
"driver_mount_paths": "bin,lib64",
"enable_fault_isolation": true,
"enable_health_monitoring": true,
"enable_metrics_monitoring": true,
"enable_simple_lib64_mount": true,
"enable_xgpu": true,
"gpu_driver_config": {},
"health_check_xids_v2": "74,79",
"inject_ld_Library_path": "",
"lib64_container_paths": "/usr/lib64,/usr/lib/x86_64-linux-gnu",
"metrics_delete_interval": 30000,
"metrics_monitor_interval": 15000,
"nvidia_driver_download_url": ""
},
}
}
}