Monitoring GPU Metrics
You can use Prometheus and Grafana to observe GPU metrics. This section provides an example of using Prometheus and Grafana to monitor GPU memory usage in a cluster
The process is as follows:
- Accessing Prometheus
(Optional) Bind a LoadBalancer Service to Prometheus so that Prometheus can be accessed from external networks.
- Monitoring GPU Metrics
After a GPU workload is deployed in the cluster, GPU metrics will be automatically reported.
- Accessing Grafana
View Prometheus monitoring data on Grafana, a visualization panel.
Prerequisites
- The Cloud Native Cluster Monitoring add-on has been installed in the cluster.
- The CCE AI Suite (NVIDIA GPU) add-on has been installed in the cluster, and the add-on version is 2.0.10 or later.
Accessing Prometheus
After the Prometheus add-on is installed, you can deploy workloads and Services. The Prometheus server will be deployed as a StatefulSet in the monitoring namespace.
You can create a public network LoadBalancer Service so that Prometheus can be accessed from an external network.
- Log in to the CCE console and click the name of the cluster with Prometheus installed to access the cluster console. In the navigation pane, choose Services & Ingresses.
- Click Create from YAML in the upper right corner to create a public network LoadBalancer Service.apiVersion: v1kind: Servicemetadata:name: prom-lb # Service name, which is customizable.namespace: monitoringlabels:app: prometheuscomponent: serverannotations:kubernetes.io/elb.id: 038ff*** # Replace it with the ID of the public network load balancer in the VPC that the cluster belongs to.spec:ports:- name: cce-service-0protocol: TCPport: 88 # Service port, which is customizable.targetPort: 9090 # Default Prometheus port. Retain the default value.selector: # The label selector can be adjusted based on the label of a Prometheus server instance.app.kubernetes.io/name: prometheusprometheus: servertype: LoadBalancer
- After the Service is created, visit Public IP address of the load balancer:Service port to access Prometheus.
- Choose Status > Targets to view the targets monitored by Prometheus.
Monitoring GPU Metrics
Create a GPU workload. After the workload runs properly, access Prometheus and view GPU metrics on the Graph page.
For more details, see GPU Metrics.
Accessing Grafana
The Prometheus add-on has had Grafana (an open-source visualization tool) installed and interconnected. You can create a public network LoadBalancer Service so that you can access Grafana from the public network and view Prometheus monitoring data on Grafana.
Click the access address to access Grafana and select a proper dashboard to view the aggregated content.
- Log in to the CCE console and click the name of the cluster with Prometheus installed to access the cluster console. In the navigation pane, choose Services & Ingresses.
- Click Create from YAML in the upper right corner to create a public network LoadBalancer Service for Grafana.apiVersion: v1kind: Servicemetadata:name: grafana-lb # Service name, which is customizablenamespace: monitoringlabels:app: grafanaannotations:kubernetes.io/elb.id: 038ff*** # Replace it with the ID of the public network load balancer in the VPC to which the cluster belongs.spec:ports:- name: cce-service-0protocol: TCPport: 80 # Service port, which is customizabletargetPort: 3000 # Default Grafana port. Retain the default value.selector:app: grafanatype: LoadBalancer
- After the Service is created, visit Public IP address of the load balancer:Service port to access Grafana and select a proper dashboard to view virtualized GPU resources.
- Prerequisites
- Accessing Prometheus
- Monitoring GPU Metrics
- Accessing Grafana