Skip to content
Neutree Documentation

Kubernetes cluster monitoring shows No data

When Neutree deploys a cluster on a Kubernetes cluster, it relies on the node-exporter and dcgm-exporter components of the Kubernetes cluster to collect and expose Prometheus-format metrics for nodes and GPUs. Neutree itself does not include deployment of these two components, so Neutree cannot retrieve or display monitoring data for nodes and GPUs.

Neutree automatically scrapes port 9100 on all nodes to collect Kubernetes cluster node metrics, and automatically scrapes port 9400 on Pods with the app=nvidia-dcgm-exporter label to collect Kubernetes cluster GPU metrics.

  • If you have already deployed node-exporter and dcgm-exporter, ensure that your components meet the metric endpoint and deployment mode requirements described in Component overview. Once the requirements are met, node and GPU monitoring data will display correctly on the cluster details monitoring page in Neutree.

  • If you have not deployed node-exporter and dcgm-exporter, manually install node-exporter on every node in the Kubernetes cluster, and manually install dcgm-exporter on nodes containing GPUs to collect node and GPU monitoring metrics. After the components are deployed, node and GPU monitoring data will display correctly on the cluster details monitoring page in Neutree.

ComponentPurposeMetric typeMetric endpointDeployment mode
node-exporterCollects hardware and OS metrics for nodes.Collects system-level monitoring data including CPU, memory, disk, and network.9100DaemonSet
dcgm-exporterCollects NVIDIA GPU metrics.Collects GPU-related monitoring data including GPU utilization, memory, temperature, and power consumption.9400DaemonSet

node-exporter exposes system metrics for nodes and must be deployed as a DaemonSet on every node in the cluster.

Prerequisites

Ensure that the cluster in Neutree can access port 9100 of node-exporter.

Steps

  1. Replace <namespace> in the following command and run it to install node-exporter.

    Terminal window
    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm install node-exporter prometheus-community/prometheus-node-exporter --namespace=<namespace>
  2. Check the node-exporter Pod status:

    Terminal window
    kubectl get pods -n <namespace> -l app=node-exporter

    All Pods should be in the Running state, indicating that node-exporter has been successfully deployed to all nodes.

  3. Verify the metric endpoint:

    Terminal window
    kubectl port-forward -n <namespace> <node-exporter-pod-name> 9100:9100
    curl http://localhost:9100/metrics

    The command should return Prometheus-format metric data, indicating that node-exporter is ready.

dcgm-exporter exposes monitoring metrics for NVIDIA GPUs and must be deployed on all nodes containing GPUs. You can use node selectors or taints and tolerations to limit the deployment scope.

Prerequisites

  • The cluster in Neutree can access the metric endpoint of dcgm-exporter (default port 9400).

  • The Kubernetes cluster has NVIDIA GPU Operator installed, or the NVIDIA graphics driver and Container Toolkit have been manually installed.

  • GPU nodes in the Kubernetes cluster are correctly recognized by Kubernetes. You can verify this by running kubectl describe node <node-name> to view GPU resources.

Steps

  1. Replace <namespace> in the following command and run it to install dcgm-exporter.

    Terminal window
    kubectl apply -f - << EOF
    apiVersion: v1
    data:
    dcgm-metrics.csv: |
    # Format
    # If line starts with a '#' it is considered a comment
    # DCGM FIELD, Prometheus metric type, help message
    # Clocks
    DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
    DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
    # Temperature
    DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
    DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
    # Power
    DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
    DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
    # PCIE
    # DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
    # DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML.
    DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
    # Utilization (the sample period varies depending on the product)
    DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
    DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
    DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %).
    DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %).
    # Errors and violations
    DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.
    # DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us).
    # DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us).
    # DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us).
    # DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
    # DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us).
    # DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
    # Memory usage
    DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
    DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
    DCGM_FI_DEV_FB_TOTAL, gauge, Framebuffer memory total (in MiB).
    DCGM_FI_DEV_FB_RESERVED, gauge, Framebuffer memory reserved (in MiB).
    # ECC
    # DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
    # DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
    # DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
    # DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
    # Retired pages
    # DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors.
    # DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors.
    # DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
    # NVLink
    # DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
    # DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
    # DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.
    # DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
    DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes.
    # DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload.
    # VGPU License status
    DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
    # Remapped rows
    DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
    DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors
    DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed
    # Static configuration information. These appear as labels on the other metrics
    DCGM_FI_DRIVER_VERSION, label, Driver Version
    # DCGM_FI_NVML_VERSION, label, NVML Version
    # DCGM_FI_DEV_BRAND, label, Device Brand
    # DCGM_FI_DEV_SERIAL, label, Device Serial Number
    # DCGM_FI_DEV_OEM_INFOROM_VER, label, OEM inforom version
    # DCGM_FI_DEV_ECC_INFOROM_VER, label, ECC inforom version
    # DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version
    # DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version
    # DCGM_FI_DEV_VBIOS_VERSION, label, VBIOS version of the device
    # DCP metrics
    DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Ratio of time the graphics engine is active (in %).
    # DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
    # DCGM_FI_PROF_SM_OCCUPANCY, gauge, The ratio of number of warps resident on an SM (in %).
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
    DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active (in %).
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active (in %).
    # DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipes are active (in %).
    DCGM_FI_PROF_PCIE_TX_BYTES, gauge, The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
    DCGM_FI_PROF_PCIE_RX_BYTES, gauge, The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
    kind: ConfigMap
    metadata:
    name: metrics-config
    namespace: <namespace>
    ---
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
    labels:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/version: 4.7.1
    name: dcgm-exporter
    namespace: <namespace>
    spec:
    revisionHistoryLimit: 10
    selector:
    matchLabels:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/version: 4.7.1
    template:
    metadata:
    creationTimestamp: null
    labels:
    app: nvidia-dcgm-exporter
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/version: 4.7.1
    name: dcgm-exporter
    spec:
    automountServiceAccountToken: false
    containers:
    - env:
    - name: DCGM_EXPORTER_LISTEN
    value: :9400
    - name: DCGM_EXPORTER_KUBERNETES
    value: "true"
    - name: DCGM_EXPORTER_COLLECTORS
    value: /etc/dcgm-exporter/dcgm-metrics.csv
    image: nvcr.io/nvidia/k8s/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04
    imagePullPolicy: IfNotPresent
    name: dcgm-exporter
    ports:
    - containerPort: 9400
    name: metrics
    protocol: TCP
    resources:
    limits:
    cpu: 200m
    memory: 512Mi
    requests:
    cpu: 100m
    memory: 128Mi
    securityContext:
    allowPrivilegeEscalation: false
    capabilities:
    add:
    - SYS_ADMIN
    drop:
    - ALL
    runAsNonRoot: false
    runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/kubelet/pod-resources
    name: pod-gpu-resources
    readOnly: true
    - mountPath: /etc/dcgm-exporter/dcgm-metrics.csv
    name: metrics-config
    readOnly: true
    subPath: dcgm-metrics.csv
    dnsPolicy: ClusterFirst
    restartPolicy: Always
    securityContext: {}
    terminationGracePeriodSeconds: 30
    volumes:
    - hostPath:
    path: /var/lib/kubelet/pod-resources
    type: ""
    name: pod-gpu-resources
    - configMap:
    defaultMode: 420
    items:
    - key: dcgm-metrics.csv
    path: dcgm-metrics.csv
    name: metrics-config
    name: metrics-config
    updateStrategy:
    rollingUpdate:
    maxSurge: 0
    maxUnavailable: 1
    type: RollingUpdate
    EOF
  2. Check the dcgm-exporter Pod status:

    Terminal window
    kubectl get pods -n <namespace> -l app=dcgm-exporter

    All Pods should be in the Running state, indicating that dcgm-exporter has been successfully deployed to all nodes containing GPUs.

  3. Verify the metric endpoint:

    Terminal window
    kubectl port-forward -n <namespace> <dcgm-exporter-pod-name> 9400:9400
    curl http://localhost:9400/metrics

    The command should return data containing GPU-related metrics (such as DCGM_FI_DEV_GPU_UTIL), indicating that dcgm-exporter is ready.

NVIDIA GPU Operator