GPU Metrics

The CCE AI Suite (NVIDIA GPU) add-on provides GPU monitoring metrics and integrates DCGM-Exporter. To use DCGM-Exporter, make sure you have version 2.7.32 or later of the add-on installed. This add-on offers additional GPU observability options. This section describes the metrics provided by CCE AI Suite (NVIDIA GPU).

GPU Metrics Provided by CCE

Table 1 Basic GPU monitoring metrics
Category	Metric	Type	Unit	Monitoring Level	Description
Utilization	cce_gpu_utilization	Gauge	%	GPU cards	GPU compute usage
	cce_gpu_memory_utilization	Gauge	%	GPU cards	GPU memory usage
	cce_gpu_encoder_utilization	Gauge	%	GPU cards	GPU encoding usage
	cce_gpu_decoder_utilization	Gauge	%	GPU cards	GPU decoding usage
	cce_gpu_utilization_process	Gauge	%	GPU processes	GPU compute usage of each process
	cce_gpu_memory_utilization_process	Gauge	%	GPU processes	GPU memory usage of each process
	cce_gpu_encoder_utilization_process	Gauge	%	GPU processes	GPU encoding usage of each process
	cce_gpu_decoder_utilization_process	Gauge	%	GPU processes	GPU decoding usage of each process
Memory	cce_gpu_memory_used	Gauge	Byte	GPU cards	Used GPU memory
	cce_gpu_memory_total	Gauge	Byte	GPU cards	Total GPU memory
	cce_gpu_memory_free	Gauge	Byte	GPU cards	Idle GPU memory
	cce_gpu_bar1_memory_used	Gauge	Byte	GPU cards	Used GPU BAR1 memory
	cce_gpu_bar1_memory_total	Gauge	Byte	GPU cards	Total GPU BAR1 memory
Frequency	cce_gpu_clock	Gauge	MHz	GPU cards	GPU clock frequency
	cce_gpu_memory_clock	Gauge	MHz	GPU cards	The speed at which the GPU memory operates
	cce_gpu_graphics_clock	Gauge	MHz	GPU cards	GPU frequency
	cce_gpu_video_clock	Gauge	MHz	GPU cards	GPU video processor frequency
Physical status	cce_gpu_temperature	Gauge	°C	GPU cards	GPU temperature
	cce_gpu_power_usage	Gauge	Milliwatt	GPU cards	GPU power
	cce_gpu_total_energy_consumption	Gauge	Millijoule	GPU cards	Total GPU energy consumption
Bandwidth	cce_gpu_pcie_link_bandwidth	Gauge	bit	GPU cards	GPU PCIe bandwidth
	cce_gpu_nvlink_bandwidth	Gauge	Gbit/s	GPU cards	GPU NVLink bandwidth
	cce_gpu_pcie_throughput_rx	Gauge	KB/s	GPU cards	GPU PCIe RX bandwidth
	cce_gpu_pcie_throughput_tx	Gauge	KB/s	GPU cards	GPU PCIe TX bandwidth
	cce_gpu_nvlink_utilization_counter_rx	Gauge	KB/s	GPU cards	GPU NVLink RX bandwidth
	cce_gpu_nvlink_utilization_counter_tx	Gauge	KB/s	GPU cards	GPU NVLink TX bandwidth
Memory isolation page	cce_gpu_retired_pages_sbe	Gauge	N/A	GPU cards	Number of isolated GPU memory pages with single-bit errors
Memory isolation page	cce_gpu_retired_pages_dbe	Gauge	N/A	GPU cards	Number of isolated GPU memory pages with dual-bit errors

GPU Metrics Provided by DCGM

Table 2 Utilization
Metric	Type	Unit	Description
DCGM_FI_DEV_GPU_UTIL	Gauge	%	GPU utilization. It specifies the time during which one or more kernel functions are active in a period (1s or 1/6s, which varies with the GPU models). This metric displays only the GPUs used by kernel functions, but does not display the specific usage.
DCGM_FI_DEV_MEM_COPY_UTIL	Gauge	%	GPU memory bandwidth utilization of a measured object For example, the maximum memory bandwidth of NVIDIA GPU V100 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%.
DCGM_FI_DEV_ENC_UTIL	Gauge	%	GPU encoder utilization of a measured object
DCGM_FI_DEV_DEC_UTIL	Gauge	%	GPU decoder utilization of a measured object

Table 3 Memory
Metric	Type	Unit	Description
DCGM_FI_DEV_FB_FREE	Gauge	MB	Amount of remaining GPU memory
DCGM_FI_DEV_FB_USED	Gauge	MB	Amount of used GPU memory The value is the same as the value of Memory-Usage in the nvidia-smi command.

Table 4 Profiling
Metric	Type	Unit	Description
DCGM_FI_PROF_GR_ENGINE_ACTIVE	Gauge	%	Percentage of the time when the graphic or compute engine is in the active state within a period. This is an average value of all graphic or compute engines. An active graphic or compute engine indicates that the graphic or compute context is associated with a thread and the graphic or compute context is busy.
DCGM_FI_PROF_SM_ACTIVE	Gauge	%	Fraction of the time during which at least one thread bundle is active on an SM within a period. This is an average value of all SMs and is insensitive to the number of threads in each block. A thread bundle is active after being scheduled and allocated with resources. The thread bundle may be in the computing state or a non-computing state (for example, waiting for a memory request). If the value is less than 0.5, GPUs are not efficiently used. The value should be greater than 0.8. For example, a GPU has N SMs: A kernel function uses N thread blocks to run on all SMs in a period. In this case, the value is 1 (100%). A kernel function runs N/5 thread blocks in a period. In this case, the value is 0.2. A kernel function uses N thread blocks and runs only 1/5 of cycles in a period. In this case, the value is 0.2.
DCGM_FI_PROF_SM_OCCUPANCY	Gauge	%	Ratio of the number of thread bundles that reside on the SM to the maximum number of thread bundles that can reside on the SM within a period. This is an average value of all SMs within a period. A high value does not mean a high GPU usage. Only when the GPU memory bandwidth is limited, a high value of workloads (DCGM_FI_PROF_DRAM_ACTIVE) indicates more efficient GPU usage.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE	Gauge	%	Fraction of cycles during which the tensor (HMMA/IMMA) pipe is active. This is an average value within a period, not an instantaneous value. A higher value indicates a higher utilization of tensor cores. Value 1 (100%) indicates that a tensor instruction is sent every instruction cycle in the entire period (one instruction is completed in two cycles). If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM tensor cores run at 100% utilization. During the entire period, all SM tensor cores run at 20% utilization. During 1/5 of the entire period, all SM tensor cores run at 100% utilization. Other combinations
DCGM_FI_PROF_PIPE_FP64_ACTIVE	Gauge	%	Fraction of cycles during which the FP64 (double precision) pipe is active. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP64 cores. Value 1 (100%) indicates that the FP64 instruction is executed every four cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM FP64 cores run at 100% utilization. During the entire period, all SM FP64 cores run at 20% utilization. During 1/5 of the entire period, all SM FP64 cores run at 100% utilization. Other combinations
DCGM_FI_PROF_PIPE_FP32_ACTIVE	Gauge	%	Fraction of cycles during which the fused multiply-add (FMA) pipe is active. Multiply-add applies to FP32 (single precision) and integers. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP32 cores. Value 1 (100%) indicates that the FP32 instruction is executed every two cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM FP32 cores run at 100% utilization. During the entire period, all SM FP32 cores run at 20% utilization. During 1/5 of the entire period, all SM FP32 cores run at 100% utilization. Other combinations
DCGM_FI_PROF_PIPE_FP16_ACTIVE	Gauge	%	Fraction of cycles during which the FP16 (half-precision) pipe is active. This is an average value within a period, not an instantaneous value. A larger value indicates a higher usage of FP16 cores. Value 1 (100%) indicates that the FP16 instruction is executed every two cycles (for example, Volta cards) in a period. If the value is 0.2 (20%), the possible causes are as follows: During the entire period, 20% of the SM FP16 cores run at 100% utilization. During the entire period, all SM FP16 cores run at 20% utilization. During 1/5 of the entire period, all SM FP16 cores run at 100% utilization. Other combinations
DCGM_FI_PROF_DRAM_ACTIVE	Gauge	%	Fraction of cycles during which Memory BW Utilization sends data to or receives from device memory. This is an average value within a period, not an instantaneous value. A higher value indicates a higher utilization of device memory. Value 1 (100%) indicates that a DRAM instruction is executed in every cycle throughout the entire time period (although a peak value of around 0.8 (80%) is the maximum achievable). If the value is set to 0.2 (20%), it means that 20% of the cycles involve reading from or writing to the device memory within the given time period.
DCGM_FI_PROF_PCIE_TX_BYTES DCGM_FI_PROF_PCIE_RX_BYTES	Counter	Byte/s	Rate of data transmitted or received over the PCIe bus, including the protocol header and data payload. This is an average value within a period, not an instantaneous value. The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum PCIe Gen3 bandwidth is 985 MB/s per channel.
DCGM_FI_PROF_NVLINK_RX_BYTES DCGM_FI_PROF_NVLINK_TX_BYTES	Counter	Byte/s	Rate at which data is transmitted or received through NVLink, excluding the protocol header. This is an average value within a period, not an instantaneous value. The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction.

Table 5 Frequency (clock)
Metric	Type	Unit	Description
DCGM_FI_DEV_SM_CLOCK	Gauge	MHz	SM clock for the device
DCGM_FI_DEV_MEM_CLOCK	Gauge	MHz	Memory clock for the device
DCGM_FI_DEV_APP_SM_CLOCK	Gauge	MHz	SM application clocks
DCGM_FI_DEV_APP_MEM_CLOCK	Gauge	MHz	Memory application clocks
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS	Gauge	MHz	The reason why the clock is throttled

Table 6 XID errors and violations
Metric	Type	Unit	Description
DCGM_FI_DEV_XID_ERRORS	Gauge	N/A	The last XID error that occurs in a period of time
DCGM_FI_DEV_POWER_VIOLATION	Counter	μs	A violation caused by the power limit. The value is the time when the violation occurs.
DCGM_FI_DEV_THERMAL_VIOLATION	Counter	μs	A violation caused by the thermal limit. The value is the time when the violation occurs.
DCGM_FI_DEV_SYNC_BOOST_VIOLATION	Counter	μs	A violation caused by the synchronous boost limit. The value is the time when the violation occurs.
DCGM_FI_DEV_BOARD_LIMIT_VIOLATION	Counter	μs	A violation caused by the board limit. The value is the time when the violation occurs.
DCGM_FI_DEV_LOW_UTIL_VIOLATION	Counter	μs	A violation caused by the low utilisation limit. The value is the time when the violation occurs.
DCGM_FI_DEV_RELIABILITY_VIOLATION	Counter	μs	A violation caused by the reliability limit. The value is the time when the violation occurs.

Table 7 BAR1
Metric	Type	Unit	Description
DCGM_FI_DEV_BAR1_USED	Gauge	MB	The used BAR1
DCGM_FI_DEV_BAR1_FREE	Gauge	MB	The remaining BAR1

Table 8 Temperature and power
Metric	Type	Unit	Description
DCGM_FI_DEV_MEMORY_TEMP	Gauge	°C	Memory temperature
DCGM_FI_DEV_GPU_TEMP	Gauge	°C	GPU temperature
DCGM_FI_DEV_POWER_USAGE	Gauge	Watt	GPU power
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION	Counter	Millijoule	Energy consumed since a driver was loaded

Table 9 Retired pages
Metric	Type	Unit	Description
DCGM_FI_DEV_RETIRED_SBE	Gauge	N/A	Number of retired pages due to single bit errors
DCGM_FI_DEV_RETIRED_DBE	Gauge	N/A	Number of retired pages due to double bit errors

For details about more DCGM metrics, see Field Identifiers.

Parent topic: GPU Scheduling

Предыдущая статья

Configuring Workload Scaling Based on GPU Monitoring Metrics

Следующая статья

Volcano Scheduling

Была ли эта статья полезна?

Поддержка Юридические документы