Kubernetes Cluster Monitoring Shows No Data

Problem Description

When deploying the data plane on a Kubernetes cluster, Neutree relies on the cluster’s node-exporter and dcgm-exporter components to collect and expose Prometheus-formatted metrics for nodes and GPUs. Since Neutree does not include the deployment of these components, monitoring data for nodes and GPUs cannot be collected and displayed.

Solution

Neutree automatically collects metrics from port 9100 on all nodes for Kubernetes cluster node metrics, and from port 9400 on Pods with the app=nvidia-dcgm-exporter label for GPU metrics.

If you have already deployed node-exporter and dcgm-exporter, ensure your components meet the metric endpoint and deployment requirements described in Component Description. Once requirements are met, you can view node and GPU monitoring data on the cluster details monitoring page.
If you have not deployed node-exporter and dcgm-exporter, manually install node-exporter on every node in your Kubernetes cluster, and manually install dcgm-exporter on nodes with GPUs to collect node and GPU monitoring metrics. After deployment, you can view node and GPU monitoring data on the cluster details monitoring page.

Component Description

Component	Purpose	Metric Types	Metric Endpoint	Deployment Method
node-exporter	Collects hardware and OS metrics from nodes.	CPU, memory, disk, network, and other system-level monitoring data.	9100	DaemonSet
dcgm-exporter	Collects NVIDIA GPU metrics.	GPU utilization, memory, temperature, power consumption, and other GPU-related monitoring data.	9400	DaemonSet

Install node-exporter

node-exporter exposes system metrics from nodes and must be deployed as a DaemonSet on every node in the cluster.

Prerequisites

Ensure the Neutree data plane can access port 9100 of node-exporter.

Steps

Modify <namespace> in the following command and execute to install node-exporter:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm install node-exporter prometheus-community/prometheus-node-exporter --namespace=<namespace>

Check the node-exporter Pod status:
Terminal window
```
kubectl get pods -n <namespace> -l app=node-exporter
```
All Pods should be in Running status, indicating node-exporter has been successfully deployed to all nodes.
Verify the metrics endpoint:
Terminal window
```
kubectl port-forward -n <namespace> <node-exporter-pod-name> 9100:9100
curl http://localhost:9100/metrics
```
Prometheus-formatted metrics data should be returned, indicating node-exporter is ready.

Install dcgm-exporter

dcgm-exporter exposes NVIDIA GPU monitoring metrics and needs to be deployed on all nodes with GPUs. You can use node selectors or tolerations to limit the deployment scope.