nav-img
Advanced

GPU Metrics

The CCE AI Suite (NVIDIA GPU) add-on provides GPU monitoring metrics and integrates DCGM-Exporter. To use DCGM-Exporter, make sure you have version 2.7.32 or later of the add-on installed. This add-on offers additional GPU observability options. This section describes the metrics provided by CCE AI Suite (NVIDIA GPU).

GPU Metrics Provided by CCE

Table 1 Basic GPU monitoring metrics

Category

Metric

Type

Unit

Monitoring Level

Description

Utilization

cce_gpu_utilization

Gauge

%

GPU cards

GPU compute usage

cce_gpu_memory_utilization

Gauge

%

GPU cards

GPU memory usage

cce_gpu_encoder_utilization

Gauge

%

GPU cards

GPU encoding usage

cce_gpu_decoder_utilization

Gauge

%

GPU cards

GPU decoding usage

cce_gpu_utilization_process

Gauge

%

GPU processes

GPU compute usage of each process

cce_gpu_memory_utilization_process

Gauge

%

GPU processes

GPU memory usage of each process

cce_gpu_encoder_utilization_process

Gauge

%

GPU processes

GPU encoding usage of each process

cce_gpu_decoder_utilization_process

Gauge

%

GPU processes

GPU decoding usage of each process

Memory

cce_gpu_memory_used

Gauge

Byte

GPU cards

Used GPU memory

cce_gpu_memory_total

Gauge

Byte

GPU cards

Total GPU memory

cce_gpu_memory_free

Gauge

Byte

GPU cards

Idle GPU memory

cce_gpu_bar1_memory_used

Gauge

Byte

GPU cards

Used GPU BAR1 memory

cce_gpu_bar1_memory_total

Gauge

Byte

GPU cards

Total GPU BAR1 memory

Frequency

cce_gpu_clock

Gauge

MHz

GPU cards

GPU clock frequency

cce_gpu_memory_clock

Gauge

MHz

GPU cards

The speed at which the GPU memory operates

cce_gpu_graphics_clock

Gauge

MHz

GPU cards

GPU frequency

cce_gpu_video_clock

Gauge

MHz

GPU cards

GPU video processor frequency

Physical status

cce_gpu_temperature

Gauge

°C

GPU cards

GPU temperature

cce_gpu_power_usage

Gauge

Milliwatt

GPU cards

GPU power

cce_gpu_total_energy_consumption

Gauge

Millijoule

GPU cards

Total GPU energy consumption

Bandwidth

cce_gpu_pcie_link_bandwidth

Gauge

bit

GPU cards

GPU PCIe bandwidth

cce_gpu_nvlink_bandwidth

Gauge

Gbit/s

GPU cards

GPU NVLink bandwidth

cce_gpu_pcie_throughput_rx

Gauge

KB/s

GPU cards

GPU PCIe RX bandwidth

cce_gpu_pcie_throughput_tx

Gauge

KB/s

GPU cards

GPU PCIe TX bandwidth

cce_gpu_nvlink_utilization_counter_rx

Gauge

KB/s

GPU cards

GPU NVLink RX bandwidth

cce_gpu_nvlink_utilization_counter_tx

Gauge

KB/s

GPU cards

GPU NVLink TX bandwidth

Memory isolation page

cce_gpu_retired_pages_sbe

Gauge

N/A

GPU cards

Number of isolated GPU memory pages with single-bit errors

cce_gpu_retired_pages_dbe

Gauge

N/A

GPU cards

Number of isolated GPU memory pages with dual-bit errors

GPU Metrics Provided by DCGM

Table 2 Utilization

Metric

Type

Unit

Description

DCGM_FI_DEV_GPU_UTIL

Gauge

%

GPU utilization. It specifies the time during which one or more kernel functions are active in a period (1s or 1/6s, which varies with the GPU models).

This metric displays only the GPUs used by kernel functions, but does not display the specific usage.

DCGM_FI_DEV_MEM_COPY_UTIL

Gauge

%

GPU memory bandwidth utilization of a measured object

For example, the maximum memory bandwidth of NVIDIA GPU V100 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%.

DCGM_FI_DEV_ENC_UTIL

Gauge

%

GPU encoder utilization of a measured object

DCGM_FI_DEV_DEC_UTIL

Gauge

%

GPU decoder utilization of a measured object

Table 3 Memory

Metric

Type

Unit

Description

DCGM_FI_DEV_FB_FREE

Gauge

MB

Amount of remaining GPU memory

DCGM_FI_DEV_FB_USED

Gauge

MB

Amount of used GPU memory

The value is the same as the value of Memory-Usage in the nvidia-smi command.

Table 4 Profiling

Metric

Type

Unit

Description

DCGM_FI_PROF_GR_ENGINE_ACTIVE

Gauge

%

Percentage of the time when the graphic or compute engine is in the active state within a period.

This is an average value of all graphic or compute engines.

An active graphic or compute engine indicates that the graphic or compute context is associated with a thread and the graphic or compute context is busy.

DCGM_FI_PROF_SM_ACTIVE

Gauge

%

Fraction of the time during which at least one thread bundle is active on an SM within a period.

This is an average value of all SMs and is insensitive to the number of threads in each block.

A thread bundle is active after being scheduled and allocated with resources. The thread bundle may be in the computing state or a non-computing state (for example, waiting for a memory request).

If the value is less than 0.5, GPUs are not efficiently used. The value should be greater than 0.8.

For example, a GPU has N SMs:

  • A kernel function uses N thread blocks to run on all SMs in a period. In this case, the value is 1 (100%).
  • A kernel function runs N/5 thread blocks in a period. In this case, the value is 0.2.
  • A kernel function uses N thread blocks and runs only 1/5 of cycles in a period. In this case, the value is 0.2.

DCGM_FI_PROF_SM_OCCUPANCY

Gauge

%

Ratio of the number of thread bundles that reside on the SM to the maximum number of thread bundles that can reside on the SM within a period.

This is an average value of all SMs within a period.

A high value does not mean a high GPU usage. Only when the GPU memory bandwidth is limited, a high value of workloads (DCGM_FI_PROF_DRAM_ACTIVE) indicates more efficient GPU usage.

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE

Gauge

%

Fraction of cycles during which the tensor (HMMA/IMMA) pipe is active.

This is an average value within a period, not an instantaneous value.

A higher value indicates a higher utilization of tensor cores.

Value 1 (100%) indicates that a tensor instruction is sent every instruction cycle in the entire period (one instruction is completed in two cycles).

If the value is 0.2 (20%), the possible causes are as follows:

  • During the entire period, 20% of the SM tensor cores run at 100% utilization.
  • During the entire period, all SM tensor cores run at 20% utilization.
  • During 1/5 of the entire period, all SM tensor cores run at 100% utilization.
  • Other combinations

DCGM_FI_PROF_PIPE_FP64_ACTIVE

Gauge

%

Fraction of cycles during which the FP64 (double precision) pipe is active.

This is an average value within a period, not an instantaneous value.

A larger value indicates a higher usage of FP64 cores.

Value 1 (100%) indicates that the FP64 instruction is executed every four cycles (for example, Volta cards) in a period.

If the value is 0.2 (20%), the possible causes are as follows:

  • During the entire period, 20% of the SM FP64 cores run at 100% utilization.
  • During the entire period, all SM FP64 cores run at 20% utilization.
  • During 1/5 of the entire period, all SM FP64 cores run at 100% utilization.
  • Other combinations

DCGM_FI_PROF_PIPE_FP32_ACTIVE

Gauge

%

Fraction of cycles during which the fused multiply-add (FMA) pipe is active. Multiply-add applies to FP32 (single precision) and integers.

This is an average value within a period, not an instantaneous value.

A larger value indicates a higher usage of FP32 cores.

Value 1 (100%) indicates that the FP32 instruction is executed every two cycles (for example, Volta cards) in a period.

If the value is 0.2 (20%), the possible causes are as follows:

  • During the entire period, 20% of the SM FP32 cores run at 100% utilization.
  • During the entire period, all SM FP32 cores run at 20% utilization.
  • During 1/5 of the entire period, all SM FP32 cores run at 100% utilization.
  • Other combinations

DCGM_FI_PROF_PIPE_FP16_ACTIVE

Gauge

%

Fraction of cycles during which the FP16 (half-precision) pipe is active.

This is an average value within a period, not an instantaneous value.

A larger value indicates a higher usage of FP16 cores.

Value 1 (100%) indicates that the FP16 instruction is executed every two cycles (for example, Volta cards) in a period.

If the value is 0.2 (20%), the possible causes are as follows:

  • During the entire period, 20% of the SM FP16 cores run at 100% utilization.
  • During the entire period, all SM FP16 cores run at 20% utilization.
  • During 1/5 of the entire period, all SM FP16 cores run at 100% utilization.
  • Other combinations

DCGM_FI_PROF_DRAM_ACTIVE

Gauge

%

Fraction of cycles during which Memory BW Utilization sends data to or receives from device memory.

This is an average value within a period, not an instantaneous value.

A higher value indicates a higher utilization of device memory.

Value 1 (100%) indicates that a DRAM instruction is executed in every cycle throughout the entire time period (although a peak value of around 0.8 (80%) is the maximum achievable).

If the value is set to 0.2 (20%), it means that 20% of the cycles involve reading from or writing to the device memory within the given time period.

DCGM_FI_PROF_PCIE_TX_BYTES

DCGM_FI_PROF_PCIE_RX_BYTES

Counter

Byte/s

Rate of data transmitted or received over the PCIe bus, including the protocol header and data payload.

This is an average value within a period, not an instantaneous value.

The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum PCIe Gen3 bandwidth is 985 MB/s per channel.

DCGM_FI_PROF_NVLINK_RX_BYTES

DCGM_FI_PROF_NVLINK_TX_BYTES

Counter

Byte/s

Rate at which data is transmitted or received through NVLink, excluding the protocol header.

This is an average value within a period, not an instantaneous value.

The rate is averaged over the period. For example, if 1 GB of data is transmitted within 1 second, the transmission rate is 1 GB/s regardless of whether the data is transmitted at a constant rate or burst. Theoretically, the maximum NVLink Gen2 bandwidth is 25 GB/s per link in each direction.

Table 5 Frequency (clock)

Metric

Type

Unit

Description

DCGM_FI_DEV_SM_CLOCK

Gauge

MHz

SM clock for the device

DCGM_FI_DEV_MEM_CLOCK

Gauge

MHz

Memory clock for the device

DCGM_FI_DEV_APP_SM_CLOCK

Gauge

MHz

SM application clocks

DCGM_FI_DEV_APP_MEM_CLOCK

Gauge

MHz

Memory application clocks

DCGM_FI_DEV_CLOCK_THROTTLE_REASONS

Gauge

MHz

The reason why the clock is throttled

Table 6 XID errors and violations

Metric

Type

Unit

Description

DCGM_FI_DEV_XID_ERRORS

Gauge

N/A

The last XID error that occurs in a period of time

DCGM_FI_DEV_POWER_VIOLATION

Counter

μs

A violation caused by the power limit. The value is the time when the violation occurs.

DCGM_FI_DEV_THERMAL_VIOLATION

Counter

μs

A violation caused by the thermal limit. The value is the time when the violation occurs.

DCGM_FI_DEV_SYNC_BOOST_VIOLATION

Counter

μs

A violation caused by the synchronous boost limit. The value is the time when the violation occurs.

DCGM_FI_DEV_BOARD_LIMIT_VIOLATION

Counter

μs

A violation caused by the board limit. The value is the time when the violation occurs.

DCGM_FI_DEV_LOW_UTIL_VIOLATION

Counter

μs

A violation caused by the low utilisation limit. The value is the time when the violation occurs.

DCGM_FI_DEV_RELIABILITY_VIOLATION

Counter

μs

A violation caused by the reliability limit. The value is the time when the violation occurs.

Table 7 BAR1

Metric

Type

Unit

Description

DCGM_FI_DEV_BAR1_USED

Gauge

MB

The used BAR1

DCGM_FI_DEV_BAR1_FREE

Gauge

MB

The remaining BAR1

Table 8 Temperature and power

Metric

Type

Unit

Description

DCGM_FI_DEV_MEMORY_TEMP

Gauge

°C

Memory temperature

DCGM_FI_DEV_GPU_TEMP

Gauge

°C

GPU temperature

DCGM_FI_DEV_POWER_USAGE

Gauge

Watt

GPU power

DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION

Counter

Millijoule

Energy consumed since a driver was loaded

Table 9 Retired pages

Metric

Type

Unit

Description

DCGM_FI_DEV_RETIRED_SBE

Gauge

N/A

Number of retired pages due to single bit errors

DCGM_FI_DEV_RETIRED_DBE

Gauge

N/A

Number of retired pages due to double bit errors

For details about more DCGM metrics, see Field Identifiers.