Kubernetes cluster monitoring shows No data

Issue

When Neutree deploys a cluster on a Kubernetes cluster, it relies on the node-exporter and dcgm-exporter components of the Kubernetes cluster to collect and expose Prometheus-format metrics for nodes and GPUs. Neutree itself does not include deployment of these two components, so Neutree cannot retrieve or display monitoring data for nodes and GPUs.

Solution

Neutree automatically scrapes port 9100 on all nodes to collect Kubernetes cluster node metrics, and automatically scrapes port 9400 on Pods with the app=nvidia-dcgm-exporter label to collect Kubernetes cluster GPU metrics.

If you have already deployed node-exporter and dcgm-exporter, ensure that your components meet the metric endpoint and deployment mode requirements described in Component overview. Once the requirements are met, node and GPU monitoring data will display correctly on the cluster details monitoring page in Neutree.
If you have not deployed node-exporter and dcgm-exporter, manually install node-exporter on every node in the Kubernetes cluster, and manually install dcgm-exporter on nodes containing GPUs to collect node and GPU monitoring metrics. After the components are deployed, node and GPU monitoring data will display correctly on the cluster details monitoring page in Neutree.

Component overview

Component	Purpose	Metric type	Metric endpoint	Deployment mode
node-exporter	Collects hardware and OS metrics for nodes.	Collects system-level monitoring data including CPU, memory, disk, and network.	9100	DaemonSet
dcgm-exporter	Collects NVIDIA GPU metrics.	Collects GPU-related monitoring data including GPU utilization, memory, temperature, and power consumption.	9400	DaemonSet

Installing node-exporter

node-exporter exposes system metrics for nodes and must be deployed as a DaemonSet on every node in the cluster.

Prerequisites

Ensure that the cluster in Neutree can access port 9100 of node-exporter.

Steps

Replace <namespace> in the following command and run it to install node-exporter.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm install node-exporter prometheus-community/prometheus-node-exporter --namespace=<namespace>

Check the node-exporter Pod status:
Terminal window
```
kubectl get pods -n <namespace> -l app=node-exporter
```
All Pods should be in the Running state, indicating that node-exporter has been successfully deployed to all nodes.
Verify the metric endpoint:
Terminal window
```
kubectl port-forward -n <namespace> <node-exporter-pod-name> 9100:9100
curl http://localhost:9100/metrics
```
The command should return Prometheus-format metric data, indicating that node-exporter is ready.

Installing dcgm-exporter

dcgm-exporter exposes monitoring metrics for NVIDIA GPUs and must be deployed on all nodes containing GPUs. You can use node selectors or taints and tolerations to limit the deployment scope.