Skip to content

Kubernetes Cluster Monitoring Shows No Data

When deploying the data plane on a Kubernetes cluster, Neutree relies on the cluster’s node-exporter and dcgm-exporter components to collect and expose Prometheus-formatted metrics for nodes and GPUs. Since Neutree does not include the deployment of these components, monitoring data for nodes and GPUs cannot be collected and displayed.

Neutree automatically collects metrics from port 9100 on all nodes for Kubernetes cluster node metrics, and from port 9400 on Pods with the app=nvidia-dcgm-exporter label for GPU metrics.

  • If you have already deployed node-exporter and dcgm-exporter, ensure your components meet the metric endpoint and deployment requirements described in Component Description. Once requirements are met, you can view node and GPU monitoring data on the cluster details monitoring page.

  • If you have not deployed node-exporter and dcgm-exporter, manually install node-exporter on every node in your Kubernetes cluster, and manually install dcgm-exporter on nodes with GPUs to collect node and GPU monitoring metrics. After deployment, you can view node and GPU monitoring data on the cluster details monitoring page.

ComponentPurposeMetric TypesMetric EndpointDeployment Method
node-exporterCollects hardware and OS metrics from nodes.CPU, memory, disk, network, and other system-level monitoring data.9100DaemonSet
dcgm-exporterCollects NVIDIA GPU metrics.GPU utilization, memory, temperature, power consumption, and other GPU-related monitoring data.9400DaemonSet

node-exporter exposes system metrics from nodes and must be deployed as a DaemonSet on every node in the cluster.

Prerequisites

Ensure the Neutree data plane can access port 9100 of node-exporter.

Steps

  1. Modify <namespace> in the following command and execute to install node-exporter:

    Terminal window
    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm install node-exporter prometheus-community/prometheus-node-exporter --namespace=<namespace>
  2. Check the node-exporter Pod status:

    Terminal window
    kubectl get pods -n <namespace> -l app=node-exporter

    All Pods should be in Running status, indicating node-exporter has been successfully deployed to all nodes.

  3. Verify the metrics endpoint:

    Terminal window
    kubectl port-forward -n <namespace> <node-exporter-pod-name> 9100:9100
    curl http://localhost:9100/metrics

    Prometheus-formatted metrics data should be returned, indicating node-exporter is ready.

dcgm-exporter exposes NVIDIA GPU monitoring metrics and needs to be deployed on all nodes with GPUs. You can use node selectors or tolerations to limit the deployment scope.

Prerequisites

  • The Neutree data plane can access the dcgm-exporter metrics endpoint (default port 9400).

  • The Kubernetes cluster has NVIDIA GPU Operator installed, or NVIDIA Graphics driver and Container Toolkit have been manually installed.

  • GPU nodes in the Kubernetes cluster are correctly recognized by Kubernetes. You can verify GPU resources using kubectl describe node <node-name>.

Steps

  1. Modify <namespace> in the following command and execute to install dcgm-exporter:

    Terminal window
    kubectl apply -f - << EOF
    apiVersion: v1
    data:
    dcgm-metrics.csv: |
    # Format
    # If line starts with a '#' it is considered a comment
    # DCGM FIELD, Prometheus metric type, help message
    # Clocks
    DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
    DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
    # Temperature
    DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
    DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
    # Power
    DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
    DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
    # PCIE
    # DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
    # DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML.
    DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
    # Utilization (the sample period varies depending on the product)
    DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
    DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
    DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %).
    DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %).
    # Errors and violations
    DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.
    # DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us).
    # DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us).
    # DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us).
    # DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
    # DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us).
    # DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
    # Memory usage
    DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
    DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
    DCGM_FI_DEV_FB_TOTAL, gauge, Framebuffer memory total (in MiB).
    DCGM_FI_DEV_FB_RESERVED, gauge, Framebuffer memory reserved (in MiB).
    # ECC
    # DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
    # DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
    # DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
    # DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
    # Retired pages
    # DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors.
    # DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors.
    # DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
    # NVLink
    # DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
    # DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
    # DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.
    # DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
    DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes.
    # DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload.
    # VGPU License status
    DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
    # Remapped rows
    DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
    DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors
    DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed
    # Static configuration information. These appear as labels on the other metrics
    DCGM_FI_DRIVER_VERSION, label, Driver Version
    # DCGM_FI_NVML_VERSION, label, NVML Version
    # DCGM_FI_DEV_BRAND, label, Device Brand
    # DCGM_FI_DEV_SERIAL, label, Device Serial Number
    # DCGM_FI_DEV_OEM_INFOROM_VER, label, OEM inforom version
    # DCGM_FI_DEV_ECC_INFOROM_VER, label, ECC inforom version
    # DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version
    # DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version
    # DCGM_FI_DEV_VBIOS_VERSION, label, VBIOS version of the device
    # DCP metrics
    DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Ratio of time the graphics engine is active (in %).
    # DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
    # DCGM_FI_PROF_SM_OCCUPANCY, gauge, The ratio of number of warps resident on an SM (in %).
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
    DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active (in %).
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active (in %).
    # DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipes are active (in %).
    DCGM_FI_PROF_PCIE_TX_BYTES, gauge, The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
    DCGM_FI_PROF_PCIE_RX_BYTES, gauge, The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
    kind: ConfigMap
    metadata:
    name: metrics-config
    namespace: <namespace>
    ---
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
    labels:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/version: 4.7.1
    name: dcgm-exporter
    namespace: <namespace>
    spec:
    revisionHistoryLimit: 10
    selector:
    matchLabels:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/version: 4.7.1
    template:
    metadata:
    creationTimestamp: null
    labels:
    app: nvidia-dcgm-exporter
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/version: 4.7.1
    name: dcgm-exporter
    spec:
    automountServiceAccountToken: false
    containers:
    - env:
    - name: DCGM_EXPORTER_LISTEN
    value: :9400
    - name: DCGM_EXPORTER_KUBERNETES
    value: "true"
    - name: DCGM_EXPORTER_COLLECTORS
    value: /etc/dcgm-exporter/dcgm-metrics.csv
    image: nvcr.io/nvidia/k8s/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04
    imagePullPolicy: IfNotPresent
    name: dcgm-exporter
    ports:
    - containerPort: 9400
    name: metrics
    protocol: TCP
    resources:
    limits:
    cpu: 200m
    memory: 512Mi
    requests:
    cpu: 100m
    memory: 128Mi
    securityContext:
    allowPrivilegeEscalation: false
    capabilities:
    add:
    - SYS_ADMIN
    drop:
    - ALL
    runAsNonRoot: false
    runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/kubelet/pod-resources
    name: pod-gpu-resources
    readOnly: true
    - mountPath: /etc/dcgm-exporter/dcgm-metrics.csv
    name: metrics-config
    readOnly: true
    subPath: dcgm-metrics.csv
    dnsPolicy: ClusterFirst
    restartPolicy: Always
    securityContext: {}
    terminationGracePeriodSeconds: 30
    volumes:
    - hostPath:
    path: /var/lib/kubelet/pod-resources
    type: ""
    name: pod-gpu-resources
    - configMap:
    defaultMode: 420
    items:
    - key: dcgm-metrics.csv
    path: dcgm-metrics.csv
    name: metrics-config
    name: metrics-config
    updateStrategy:
    rollingUpdate:
    maxSurge: 0
    maxUnavailable: 1
    type: RollingUpdate
    EOF
  2. Check the dcgm-exporter Pod status:

    Terminal window
    kubectl get pods -n <namespace> -l app=dcgm-exporter

    All Pods should be in Running status, indicating dcgm-exporter has been successfully deployed to all nodes with GPUs.

  3. Verify the metrics endpoint:

    Terminal window
    kubectl port-forward -n <namespace> <dcgm-exporter-pod-name> 9400:9400
    curl http://localhost:9400/metrics

    GPU-related metrics data (such as DCGM_FI_DEV_GPU_UTIL) should be returned, indicating dcgm-exporter is ready.

NVIDIA GPU Operator