Kubernetes cluster monitoring shows No data
When Neutree deploys a cluster on a Kubernetes cluster, it relies on the node-exporter and dcgm-exporter components of the Kubernetes cluster to collect and expose Prometheus-format metrics for nodes and GPUs. Neutree itself does not include deployment of these two components, so Neutree cannot retrieve or display monitoring data for nodes and GPUs.
Solution
Section titled “Solution”Neutree automatically scrapes port 9100 on all nodes to collect Kubernetes cluster node metrics, and automatically scrapes port 9400 on Pods with the app=nvidia-dcgm-exporter label to collect Kubernetes cluster GPU metrics.
-
If you have already deployed
node-exporteranddcgm-exporter, ensure that your components meet the metric endpoint and deployment mode requirements described in Component overview. Once the requirements are met, node and GPU monitoring data will display correctly on the cluster details monitoring page in Neutree. -
If you have not deployed
node-exporteranddcgm-exporter, manually install node-exporter on every node in the Kubernetes cluster, and manually install dcgm-exporter on nodes containing GPUs to collect node and GPU monitoring metrics. After the components are deployed, node and GPU monitoring data will display correctly on the cluster details monitoring page in Neutree.
Component overview
Section titled “Component overview”| Component | Purpose | Metric type | Metric endpoint | Deployment mode |
|---|---|---|---|---|
| node-exporter | Collects hardware and OS metrics for nodes. | Collects system-level monitoring data including CPU, memory, disk, and network. | 9100 | DaemonSet |
| dcgm-exporter | Collects NVIDIA GPU metrics. | Collects GPU-related monitoring data including GPU utilization, memory, temperature, and power consumption. | 9400 | DaemonSet |
Installing node-exporter
Section titled “Installing node-exporter”node-exporter exposes system metrics for nodes and must be deployed as a DaemonSet on every node in the cluster.
Prerequisites
Ensure that the cluster in Neutree can access port 9100 of node-exporter.
Steps
-
Replace
<namespace>in the following command and run it to installnode-exporter.Terminal window helm repo add prometheus-community https://prometheus-community.github.io/helm-chartshelm install node-exporter prometheus-community/prometheus-node-exporter --namespace=<namespace> -
Check the node-exporter Pod status:
Terminal window kubectl get pods -n <namespace> -l app=node-exporterAll Pods should be in the
Runningstate, indicating thatnode-exporterhas been successfully deployed to all nodes. -
Verify the metric endpoint:
Terminal window kubectl port-forward -n <namespace> <node-exporter-pod-name> 9100:9100curl http://localhost:9100/metricsThe command should return Prometheus-format metric data, indicating that
node-exporteris ready.
Installing dcgm-exporter
Section titled “Installing dcgm-exporter”dcgm-exporter exposes monitoring metrics for NVIDIA GPUs and must be deployed on all nodes containing GPUs. You can use node selectors or taints and tolerations to limit the deployment scope.
Prerequisites
-
The cluster in Neutree can access the metric endpoint of
dcgm-exporter(default port 9400). -
The Kubernetes cluster has NVIDIA GPU Operator installed, or the NVIDIA graphics driver and Container Toolkit have been manually installed.
-
GPU nodes in the Kubernetes cluster are correctly recognized by Kubernetes. You can verify this by running
kubectl describe node <node-name>to view GPU resources.
Steps
-
Replace
<namespace>in the following command and run it to installdcgm-exporter.Terminal window kubectl apply -f - << EOFapiVersion: v1data:dcgm-metrics.csv: |# Format# If line starts with a '#' it is considered a comment# DCGM FIELD, Prometheus metric type, help message# ClocksDCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).# TemperatureDCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).# PowerDCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).# PCIE# DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.# DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML.DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.# Utilization (the sample period varies depending on the product)DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %).DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %).# Errors and violationsDCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.# DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us).# DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us).# DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us).# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).# DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us).# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).# Memory usageDCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).DCGM_FI_DEV_FB_TOTAL, gauge, Framebuffer memory total (in MiB).DCGM_FI_DEV_FB_RESERVED, gauge, Framebuffer memory reserved (in MiB).# ECC# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.# Retired pages# DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors.# DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors.# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.# NVLink# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes.# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload.# VGPU License statusDCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status# Remapped rowsDCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errorsDCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errorsDCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed# Static configuration information. These appear as labels on the other metricsDCGM_FI_DRIVER_VERSION, label, Driver Version# DCGM_FI_NVML_VERSION, label, NVML Version# DCGM_FI_DEV_BRAND, label, Device Brand# DCGM_FI_DEV_SERIAL, label, Device Serial Number# DCGM_FI_DEV_OEM_INFOROM_VER, label, OEM inforom version# DCGM_FI_DEV_ECC_INFOROM_VER, label, ECC inforom version# DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version# DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version# DCGM_FI_DEV_VBIOS_VERSION, label, VBIOS version of the device# DCP metricsDCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Ratio of time the graphics engine is active (in %).# DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).# DCGM_FI_PROF_SM_OCCUPANCY, gauge, The ratio of number of warps resident on an SM (in %).DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).# DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active (in %).# DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active (in %).# DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipes are active (in %).DCGM_FI_PROF_PCIE_TX_BYTES, gauge, The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.DCGM_FI_PROF_PCIE_RX_BYTES, gauge, The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.kind: ConfigMapmetadata:name: metrics-confignamespace: <namespace>---apiVersion: apps/v1kind: DaemonSetmetadata:labels:app.kubernetes.io/name: dcgm-exporterapp.kubernetes.io/version: 4.7.1name: dcgm-exporternamespace: <namespace>spec:revisionHistoryLimit: 10selector:matchLabels:app.kubernetes.io/name: dcgm-exporterapp.kubernetes.io/version: 4.7.1template:metadata:creationTimestamp: nulllabels:app: nvidia-dcgm-exporterapp.kubernetes.io/name: dcgm-exporterapp.kubernetes.io/version: 4.7.1name: dcgm-exporterspec:automountServiceAccountToken: falsecontainers:- env:- name: DCGM_EXPORTER_LISTENvalue: :9400- name: DCGM_EXPORTER_KUBERNETESvalue: "true"- name: DCGM_EXPORTER_COLLECTORSvalue: /etc/dcgm-exporter/dcgm-metrics.csvimage: nvcr.io/nvidia/k8s/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04imagePullPolicy: IfNotPresentname: dcgm-exporterports:- containerPort: 9400name: metricsprotocol: TCPresources:limits:cpu: 200mmemory: 512Mirequests:cpu: 100mmemory: 128MisecurityContext:allowPrivilegeEscalation: falsecapabilities:add:- SYS_ADMINdrop:- ALLrunAsNonRoot: falserunAsUser: 0terminationMessagePath: /dev/termination-logterminationMessagePolicy: FilevolumeMounts:- mountPath: /var/lib/kubelet/pod-resourcesname: pod-gpu-resourcesreadOnly: true- mountPath: /etc/dcgm-exporter/dcgm-metrics.csvname: metrics-configreadOnly: truesubPath: dcgm-metrics.csvdnsPolicy: ClusterFirstrestartPolicy: AlwayssecurityContext: {}terminationGracePeriodSeconds: 30volumes:- hostPath:path: /var/lib/kubelet/pod-resourcestype: ""name: pod-gpu-resources- configMap:defaultMode: 420items:- key: dcgm-metrics.csvpath: dcgm-metrics.csvname: metrics-configname: metrics-configupdateStrategy:rollingUpdate:maxSurge: 0maxUnavailable: 1type: RollingUpdateEOF -
Check the dcgm-exporter Pod status:
Terminal window kubectl get pods -n <namespace> -l app=dcgm-exporterAll Pods should be in the
Runningstate, indicating thatdcgm-exporterhas been successfully deployed to all nodes containing GPUs. -
Verify the metric endpoint:
Terminal window kubectl port-forward -n <namespace> <dcgm-exporter-pod-name> 9400:9400curl http://localhost:9400/metricsThe command should return data containing GPU-related metrics (such as
DCGM_FI_DEV_GPU_UTIL), indicating thatdcgm-exporteris ready.