nav-img
Advanced

Monitoring GPU Metrics

You can use Prometheus and Grafana to observe GPU metrics. This section provides an example of using Prometheus and Grafana to monitor GPU memory usage in a cluster

The process is as follows:

  1. Accessing Prometheus

    (Optional) Bind a LoadBalancer Service to Prometheus so that Prometheus can be accessed from external networks.

  2. Monitoring GPU Metrics

    After a GPU workload is deployed in the cluster, GPU metrics will be automatically reported.

  3. Accessing Grafana

    View Prometheus monitoring data on Grafana, a visualization panel.

Prerequisites

Accessing Prometheus

After the Prometheus add-on is installed, you can deploy workloads and Services. The Prometheus server will be deployed as a StatefulSet in the monitoring namespace.

You can create a public network LoadBalancer Service so that Prometheus can be accessed from an external network.

  1. Log in to the CCE console and click the name of the cluster with Prometheus installed to access the cluster console. In the navigation pane, choose Services & Ingresses.
  2. Click Create from YAML in the upper right corner to create a public network LoadBalancer Service.

    apiVersion: v1
    kind: Service
    metadata:
    name: prom-lb # Service name, which is customizable.
    namespace: monitoring
    labels:
    app: prometheus
    component: server
    annotations:
    kubernetes.io/elb.id: 038ff*** # Replace it with the ID of the public network load balancer in the VPC that the cluster belongs to.
    spec:
    ports:
    - name: cce-service-0
    protocol: TCP
    port: 88 # Service port, which is customizable.
    targetPort: 9090 # Default Prometheus port. Retain the default value.
    selector: # The label selector can be adjusted based on the label of a Prometheus server instance.
    app.kubernetes.io/name: prometheus
    prometheus: server
    type: LoadBalancer

  3. After the Service is created, visit Public IP address of the load balancer:Service port to access Prometheus.
  4. Choose Status > Targets to view the targets monitored by Prometheus.

Monitoring GPU Metrics

Create a GPU workload. After the workload runs properly, access Prometheus and view GPU metrics on the Graph page.

For more details, see GPU Metrics.

Accessing Grafana

The Prometheus add-on has had Grafana (an open-source visualization tool) installed and interconnected. You can create a public network LoadBalancer Service so that you can access Grafana from the public network and view Prometheus monitoring data on Grafana.

Click the access address to access Grafana and select a proper dashboard to view the aggregated content.

  1. Log in to the CCE console and click the name of the cluster with Prometheus installed to access the cluster console. In the navigation pane, choose Services & Ingresses.
  2. Click Create from YAML in the upper right corner to create a public network LoadBalancer Service for Grafana.

    apiVersion: v1
    kind: Service
    metadata:
    name: grafana-lb # Service name, which is customizable
    namespace: monitoring
    labels:
    app: grafana
    annotations:
    kubernetes.io/elb.id: 038ff*** # Replace it with the ID of the public network load balancer in the VPC to which the cluster belongs.
    spec:
    ports:
    - name: cce-service-0
    protocol: TCP
    port: 80 # Service port, which is customizable
    targetPort: 3000 # Default Grafana port. Retain the default value.
    selector:
    app: grafana
    type: LoadBalancer

  3. After the Service is created, visit Public IP address of the load balancer:Service port to access Grafana and select a proper dashboard to view virtualized GPU resources.