GPU Metrics
The CCE AI Suite (NVIDIA GPU) add-on provides GPU monitoring metrics and integrates DCGM-Exporter. To use DCGM-Exporter, make sure you have version 2.7.32 or later of the add-on installed. This add-on offers additional GPU observability options. This section describes the metrics provided by CCE AI Suite (NVIDIA GPU).
GPU Metrics Provided by CCE
Category | Metric | Type | Unit | Monitoring Level | Description |
---|---|---|---|---|---|
Utilization | cce_gpu_utilization | Gauge | % | GPU cards | GPU compute usage |
cce_gpu_memory_utilization | Gauge | % | GPU cards | GPU memory usage | |
cce_gpu_encoder_utilization | Gauge | % | GPU cards | GPU encoding usage | |
cce_gpu_decoder_utilization | Gauge | % | GPU cards | GPU decoding usage | |
cce_gpu_utilization_process | Gauge | % | GPU processes | GPU compute usage of each process | |
cce_gpu_memory_utilization_process | Gauge | % | GPU processes | GPU memory usage of each process | |
cce_gpu_encoder_utilization_process | Gauge | % | GPU processes | GPU encoding usage of each process | |
cce_gpu_decoder_utilization_process | Gauge | % | GPU processes | GPU decoding usage of each process | |
Memory | cce_gpu_memory_used | Gauge | Byte | GPU cards | Used GPU memory |
cce_gpu_memory_total | Gauge | Byte | GPU cards | Total GPU memory | |
cce_gpu_memory_free | Gauge | Byte | GPU cards | Idle GPU memory | |
cce_gpu_bar1_memory_used | Gauge | Byte | GPU cards | Used GPU BAR1 memory | |
cce_gpu_bar1_memory_total | Gauge | Byte | GPU cards | Total GPU BAR1 memory | |
Frequency | cce_gpu_clock | Gauge | MHz | GPU cards | GPU clock frequency |
cce_gpu_memory_clock | Gauge | MHz | GPU cards | The speed at which the GPU memory operates | |
cce_gpu_graphics_clock | Gauge | MHz | GPU cards | GPU frequency | |
cce_gpu_video_clock | Gauge | MHz | GPU cards | GPU video processor frequency | |
Physical status | cce_gpu_temperature | Gauge | °C | GPU cards | GPU temperature |
cce_gpu_power_usage | Gauge | Milliwatt | GPU cards | GPU power | |
cce_gpu_total_energy_consumption | Gauge | Millijoule | GPU cards | Total GPU energy consumption | |
Bandwidth | cce_gpu_pcie_link_bandwidth | Gauge | bit | GPU cards | GPU PCIe bandwidth |
cce_gpu_nvlink_bandwidth | Gauge | Gbit/s | GPU cards | GPU NVLink bandwidth | |
cce_gpu_pcie_throughput_rx | Gauge | KB/s | GPU cards | GPU PCIe RX bandwidth | |
cce_gpu_pcie_throughput_tx | Gauge | KB/s | GPU cards | GPU PCIe TX bandwidth | |
cce_gpu_nvlink_utilization_counter_rx | Gauge | KB/s | GPU cards | GPU NVLink RX bandwidth | |
cce_gpu_nvlink_utilization_counter_tx | Gauge | KB/s | GPU cards | GPU NVLink TX bandwidth | |
Memory isolation page | cce_gpu_retired_pages_sbe | Gauge | N/A | GPU cards | Number of isolated GPU memory pages with single-bit errors |
cce_gpu_retired_pages_dbe | Gauge | N/A | GPU cards | Number of isolated GPU memory pages with dual-bit errors |
GPU Metrics Provided by DCGM
Metric | Type | Unit | Description |
---|---|---|---|
DCGM_FI_DEV_GPU_UTIL | Gauge | % | GPU utilization. It specifies the time during which one or more kernel functions are active in a period (1s or 1/6s, which varies with the GPU models). This metric displays only the GPUs used by kernel functions, but does not display the specific usage. |
DCGM_FI_DEV_MEM_COPY_UTIL | Gauge | % | GPU memory bandwidth utilization of a measured object For example, the maximum memory bandwidth of NVIDIA GPU V100 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%. |
DCGM_FI_DEV_ENC_UTIL | Gauge | % | GPU encoder utilization of a measured object |
DCGM_FI_DEV_DEC_UTIL | Gauge | % | GPU decoder utilization of a measured object |
Metric | Type | Unit | Description |
---|---|---|---|
DCGM_FI_DEV_FB_FREE | Gauge | MB | Amount of remaining GPU memory |
DCGM_FI_DEV_FB_USED | Gauge | MB | Amount of used GPU memory The value is the same as the value of Memory-Usage in the nvidia-smi command. |
Metric | Type | Unit | Description |
---|---|---|---|
DCGM_FI_PROF_GR_ENGINE_ACTIVE | Gauge | % | Percentage of the time when the graphic or compute engine is in the active state within a period. This is an average value of all graphic or compute engines. An active graphic or compute engine indicates that the graphic or compute context is associated with a thread and the graphic or compute context is busy. |
DCGM_FI_PROF_SM_ACTIVE | Gauge | % | Fraction of the time during which at least one thread bundle is active on an SM within a period. This is an average value of all SMs and is insensitive to the number of threads in each block. A thread bundle is active after being scheduled and allocated with resources. The thread bundle may be in the computing state or a non-computing state (for example, waiting for a memory request). If the value is less than 0.5, GPUs are not efficiently used. The value should be greater than 0.8. For example, a GPU has N SMs:
|
DCGM_FI_PROF_SM_OCCUPANCY | Gauge | % | Ratio of the number of thread bundles that reside on the SM to the maximum number of thread bundles that can reside on the SM within a period. This is an average value of all SMs within a period. A high value does not mean a high GPU usage. Only when the GPU memory bandwidth is limited, a high value of workloads (DCGM_FI_PROF_DRAM_ACTIVE) indicates more efficient GPU usage. |
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | Gauge | % | Fraction of cycles during which the tensor (HMMA/IMMA) pipe is active. This is an average value within a period, not an instantaneous value. A higher value indicates a higher utilization of tensor cores. Value 1 (100%) indicates that a tensor instruction is sent every instruction cycle in the entire period (one instruction is completed in two cycles). If the value is 0.2 (20%), the possible causes are as follows:
|
DCGM_FI_PROF_PIPE_FP64_ACTIVE | Gauge | % | Fraction of cycles during which the FP64 (double precision) pipe is active. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP64 cores. Value 1 (100%) indicates that the FP64 instruction is executed every four cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows:
|
DCGM_FI_PROF_PIPE_FP32_ACTIVE | Gauge | % | Fraction of cycles during which the fused multiply-add (FMA) pipe is active. Multiply-add applies to FP32 (single precision) and integers. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP32 cores. Value 1 (100%) indicates that the FP32 instruction is executed every two cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows:
|
DCGM_FI_PROF_PIPE_FP16_ACTIVE | Gauge | % | Fraction of cycles during which the FP16 (half-precision) pipe is active. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP16 cores. Value 1 (100%) indicates that the FP16 instruction is executed every two cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows:
|
DCGM_FI_PROF_DRAM_ACTIVE | Gauge | % | Fraction of cycles during which Memory BW Utilization sends data to or receives from device memory. This is an average value within a period, not an instantaneous value. A higher value indicates a higher utilization of device memory. Value 1 (100%) indicates that a DRAM instruction is executed in every cycle throughout the entire time period (although a peak value of around 0.8 (80%) is the maximum achievable). If the value is set to 0.2 (20%), it means that 20% of the cycles involve reading from or writing to the device memory within the given time period. |
DCGM_FI_PROF_PCIE_TX_BYTES DCGM_FI_PROF_PCIE_RX_BYTES | Counter | Byte/s | Rate of data transmitted or received over the PCIe bus, including the protocol header and data payload. This is an average value within a period, not an instantaneous value. The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum PCIe Gen3 bandwidth is 985 MB/s per channel. |
DCGM_FI_PROF_NVLINK_RX_BYTES DCGM_FI_PROF_NVLINK_TX_BYTES | Counter | Byte/s | Rate at which data is transmitted or received through NVLink, excluding the protocol header. This is an average value within a period, not an instantaneous value. The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction. |
Metric | Type | Unit | Description |
---|---|---|---|
DCGM_FI_DEV_SM_CLOCK | Gauge | MHz | SM clock for the device |
DCGM_FI_DEV_MEM_CLOCK | Gauge | MHz | Memory clock for the device |
DCGM_FI_DEV_APP_SM_CLOCK | Gauge | MHz | SM application clocks |
DCGM_FI_DEV_APP_MEM_CLOCK | Gauge | MHz | Memory application clocks |
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS | Gauge | MHz | The reason why the clock is throttled |
Metric | Type | Unit | Description |
---|---|---|---|
DCGM_FI_DEV_XID_ERRORS | Gauge | N/A | The last XID error that occurs in a period of time |
DCGM_FI_DEV_POWER_VIOLATION | Counter | μs | A violation caused by the power limit. The value is the time when the violation occurs. |
DCGM_FI_DEV_THERMAL_VIOLATION | Counter | μs | A violation caused by the thermal limit. The value is the time when the violation occurs. |
DCGM_FI_DEV_SYNC_BOOST_VIOLATION | Counter | μs | A violation caused by the synchronous boost limit. The value is the time when the violation occurs. |
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION | Counter | μs | A violation caused by the board limit. The value is the time when the violation occurs. |
DCGM_FI_DEV_LOW_UTIL_VIOLATION | Counter | μs | A violation caused by the low utilisation limit. The value is the time when the violation occurs. |
DCGM_FI_DEV_RELIABILITY_VIOLATION | Counter | μs | A violation caused by the reliability limit. The value is the time when the violation occurs. |
Metric | Type | Unit | Description |
---|---|---|---|
DCGM_FI_DEV_BAR1_USED | Gauge | MB | The used BAR1 |
DCGM_FI_DEV_BAR1_FREE | Gauge | MB | The remaining BAR1 |
Metric | Type | Unit | Description |
---|---|---|---|
DCGM_FI_DEV_MEMORY_TEMP | Gauge | °C | Memory temperature |
DCGM_FI_DEV_GPU_TEMP | Gauge | °C | GPU temperature |
DCGM_FI_DEV_POWER_USAGE | Gauge | Watt | GPU power |
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION | Counter | Millijoule | Energy consumed since a driver was loaded |
Metric | Type | Unit | Description |
---|---|---|---|
DCGM_FI_DEV_RETIRED_SBE | Gauge | N/A | Number of retired pages due to single bit errors |
DCGM_FI_DEV_RETIRED_DBE | Gauge | N/A | Number of retired pages due to double bit errors |
For details about more DCGM metrics, see Field Identifiers.
- GPU Metrics Provided by CCE
- GPU Metrics Provided by DCGM