Managing static node clusters
Servers or virtual machines can be added as nodes to form a static node cluster. There are two node types: head node and worker node.
- Head node: Both a control node and a worker node. It runs management services and can also run AI workloads.
- Worker node: Runs AI workloads only; does not run management services.
The minimum cluster size is a single node (head node only, with no worker nodes required). In this configuration, the head node runs both management services and AI workloads simultaneously.
In a multi-node cluster, it is recommended to use a node without accelerators as the head node for management services, and add nodes with accelerators as worker nodes dedicated to AI workloads.
Plan the number of worker nodes in advance based on your business requirements.
Node requirements
Ensure node configurations meet the following requirements:
-
Resource configuration
- System disk: 200 GiB
- CPU: at least 8-core vCPU
- Memory: at least 16 GiB
-
OS image
Accelerator type OS image CPU or NVIDIA GPU Rocky-8.10-x86_64-minimal.iso AMD GPU Ubuntu-22.04.5-live-server-amd64.iso -
Port requirements
If there is a firewall between your static node cluster and Neutree, open the following ports on the target side. Unless otherwise specified, all ports listed below are TCP ports.
Source Destination Port Purpose Control plane All nodes 22 Used for remote login, static node initialization, and maintenance. If the node uses a non-standard SSH port, see Using a non-standard SSH port to configure the node. 54311 Scrape node runtime status data. 44217 Retrieve monitoring data for auto-scaling. 44227 Export monitoring data for dashboards. All nodes All nodes 10002-20000 Core channel for inter-node data exchange and distributed computing. 8077 Used for node management. 8076 Used for shared memory object access and distribution. 56999 Used for managing execution environments (dependencies, etc.) on each node. Head node All nodes 52365, 8078 Proxy for dashboard command delivery. Control plane Head node 8265, 8079 Access the graphical management interface. 8000 Entry point for the vLLM model inference service. Worker node Head node 6379 Ray cluster metadata center. Developer Head node 10001 Allows connecting to the cluster from remote scripts to run jobs. All nodes Node where monitoring components are deployed 8480 Required when monitoring components are deployed on a server or VM to upload monitoring metrics. LoadBalancer IP allocated to monitoring components deployed on Kubernetes 8480 Required when monitoring components are deployed on a Kubernetes cluster to upload monitoring metrics.
Configuring the operating system
Configure the system and install Docker as the container runtime according to the OS type of your nodes.
System configuration
-
Configure static IP addresses:
Terminal window sudo vi /etc/sysconfig/network-scripts/ifcfg-<interface>Replace
<interface>with the network interface name, for exampleeth0. -
Configure the DNS server:
Terminal window sudo vi /etc/resolv.conf -
Disable the firewall:
Terminal window sudo systemctl stop firewalld && sudo systemctl disable firewalld -
Disable SELinux:
Terminal window echo -e "SELINUX=disabled\nSELINUXTYPE=targeted" | sudo tee /etc/selinux/configsudo setenforce 0 -
Install dependencies:
sudo dnf install rsync pciutils -y
-
Configure static IP addresses and DNS server:
Terminal window sudo vi /etc/netplan/50-cloud-init.yaml -
Apply the network configuration:
Terminal window sudo netplan apply -
Disable the firewall:
Terminal window sudo ufw disable -
Optional: disable AppArmor if needed:
Terminal window sudo systemctl disable apparmor && sudo systemctl stop apparmor -
Install dependencies:
sudo apt-get update && sudo apt-get install rsync pciutils -y -
Restart the OS for the configuration to take effect:
Terminal window sudo reboot
Installing Docker
-
Install Docker CE:
Terminal window sudo dnf -y install dnf-plugins-coresudo dnf config-manager --add-repo https://download.docker.com/linux/rhel/docker-ce.reposudo dnf -y install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -
Start the Docker service:
Terminal window sudo systemctl enable --now docker -
Confirm Docker is installed successfully:
Terminal window docker --version -
Restart the OS for the configuration to take effect:
Terminal window sudo reboot
-
Update the package index:
Terminal window sudo apt-get updatesudo apt-get -y install ca-certificates curl -
Add the Docker GPG key:
Terminal window sudo install -m 0755 -d /etc/apt/keyringssudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.ascsudo chmod a+r /etc/apt/keyrings/docker.asc -
Add the Docker repository:
Terminal window echo \"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \$(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \sudo tee /etc/apt/sources.list.d/docker.list > /dev/null -
Install Docker CE:
Terminal window sudo apt-get updatesudo apt-get -y install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -
Start the Docker service:
Terminal window sudo systemctl enable --now docker -
Confirm Docker is installed successfully:
Terminal window docker --version
Configuring accelerators
If the nodes include accelerators, complete the corresponding configuration based on accelerator type:
Follow NVIDIA’s official documentation to complete the following:
-
Disable the NVIDIA GPU Nouveau driver. See the Disabling the Nouveau Driver for NVIDIA Graphics Cards section in the Virtual GPU Software User Guide.
-
Install an NVIDIA Graphics driver with a version no higher than 590.x.x and no lower than 530.x.x. See the Installing the NVIDIA vGPU Software Graphics Driver section in the Virtual GPU Software User Guide.
-
Install the NVIDIA Container Toolkit. See the Installing the NVIDIA Container Toolkit section in the NVIDIA Container Toolkit documentation. If the node cannot access the internet, see Installing NVIDIA Container Toolkit offline.
The cluster image currently supports ROCM software version 6.3.3. Install the corresponding version of the AMDGPU driver and AMD Container Toolkit. See the Quick Start Guide section in the AMD Container Toolkit Documentation from AMD’s official documentation.
Preparing SSH private key
Before creating a static node cluster, you need to prepare an SSH private key for node authentication. The control plane uses the SSH private key to securely connect to and manage nodes in the cluster via the SSH protocol.
Creating an SSH key pair
An SSH key pair consists of a public key and a private key, used for node authentication and secure communication. For security, it is recommended to use different SSH key pairs for different clusters.
If you do not have an SSH key pair, create one with the following steps:
-
Run the following command on the control plane or a local machine to generate an SSH key pair:
Terminal window ssh-keygen -t rsa -b 4096 -C "your_email@example.com" -f ~/.ssh/neutree_cluster_keyParameter descriptions:
Parameter Description -t rsaSpecifies the key encryption algorithm as RSA. -b 4096Specifies the key length as 4096 bits. -C "your_email@example.com"Adds a comment to the key, typically an email address. -f ~/.ssh/neutree_cluster_keySpecifies the save path and file name for the key. -
When prompted with
Enter passphrase (empty for no passphrase), press Enter to leave it empty (no passphrase). -
After the command completes, the key pair is generated at the specified location:
- The private key file is at
~/.ssh/neutree_cluster_key. The private key is sensitive information; keep it safe and do not share it. - The public key file is at
~/.ssh/neutree_cluster_key.pub. This must be configured on all nodes in the cluster.
- The private key file is at
Configuring the public key on target nodes
The ~/.ssh/authorized_keys file on each node stores the public keys allowed to access it. After configuring the public key on all nodes following this section, SSH will automatically use key-based login without prompting for a password.
-
Copy the public key content to the
~/.ssh/authorized_keysfile on the target node using one of the following methods:Terminal window ssh-copy-id -i ~/.ssh/neutree_cluster_key.pub <username>@<node_ip>Terminal window cat ~/.ssh/neutree_cluster_key.pub | ssh <username>@<node_ip> "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"Parameter Description <username>SSH username. Must be root or a user with root privileges. <node_ip>The IP address of the cluster node. -
Set the permissions for the
~/.ssh/authorized_keysfile and its parent directory on the target node:Terminal window ssh <username>@<node_ip> "chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys"
Verifying SSH connectivity
Run the following commands on all nodes in the cluster to test SSH connectivity and ensure each node connects successfully.
-
Test SSH connectivity using the private key:
Terminal window ssh -i ~/.ssh/neutree_cluster_key <username>@<node_ip>You should be able to log in without a password, indicating that the SSH key is configured correctly.
-
For non-root users, test root privileges:
Terminal window ssh -i ~/.ssh/neutree_cluster_key <username>@<node_ip> "sudo whoami"The expected output is
root, indicating the user has sudo privileges.
Retrieving the private key content
Before creating the cluster, retrieve the private key content with the following command:
cat ~/.ssh/neutree_cluster_keyThe private key content looks similar to:
-----BEGIN OPENSSH PRIVATE KEY-----b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAACFwAAAAdzc2gtcn...-----END OPENSSH PRIVATE KEY-----Warning
- The private key is sensitive information. Keep it safe and do not share it with others.
- Ensure the private key file has permissions set to 600 (
chmod 600 ~/.ssh/neutree_cluster_key), otherwise the SSH server may refuse to use the key.
Creating a cluster
Follow the steps below to create a cluster. If the cluster nodes cannot access Docker Hub or the connection is slow, you can manually import cluster images.
-
Log in to the Neutree management interface. Click Clusters in the left navigation pane, then click Create on the right.
-
Fill in the configuration.
-
Basic Information
Parameter Description Editable after creation Name The name of the cluster. No Workspace The workspace to which the cluster belongs. No -
Image Registry
Select a container registry for the cluster to store cluster-related container images. If no registry is available, see the Creating a container registry section. If no registry is available in your environment, see the Set up a temporary container registry section. This field is not editable after creation.
-
Cluster type
The cluster type. Select Static Nodes. This field is not editable after creation.
-
Version
The cluster version. The system automatically retrieves available cluster versions from the selected container registry. After creation, the version can be updated via Upgrading the cluster version.
-
Provider
Parameter Description Editable after creation Head Node IP The IP address of the head node. No Worker Node IPs The IP address of worker nodes. - Not required for single-node clusters.
- For multi-node clusters, enter an IP address and click + Add to add the next one.
Yes -
Node Authentication
Parameter Description Editable after creation SSH User SSH username. Must be root or a user with root privileges. No SSH Private Key The SSH private key string. See the Preparing SSH private key section for how to obtain it. No -
Model Caches
Parameter Description Editable after creation Name The name of the model cache. No Cache Type Static node clusters only support Host Path. No Cache Path The host path for the model cache. Yes If model cache is not configured during creation, it cannot be added after the cluster is created.
-
-
After confirming the configuration is correct, click Save to complete creation.
Manually importing a cluster image
When upgrading the cluster version or when the network environment is restricted, you can manually import the required cluster images into the Neutree container registry.
Procedure
-
Download version 1.0.1 of the Neutree CLI tool and the cluster offline image for the required accelerator type, according to the server’s CPU architecture.
-
Use the CLI tool to upload the cluster offline image to the specified registry:
Terminal window ./neutree-cli-<arch> import cluster \--package <cluster_package> \--mirror-registry <mirror_registry> \[--registry-project <registry_project>] \--registry-username <registry_username> \--registry-password <registry_password>Parameter Description <arch>The server’s CPU architecture: amd64oraarch64.<cluster_package>The cluster offline image name, in the format neutree-cluster-ssh-v1.0.1-<arch>.tar.gz.<mirror_registry>The registry address must match the registry address used when uploading images with the CLI tool during Neutree management plane deployment. Enter an OCI-compatible image registry address without the https://prefix.--registry-project <registry_project>Optional. The registry project name. Ensure the corresponding project has been created in the registry in advance. <registry_username>The username for the registry, must have image upload permissions. <registry_password>The login password or access key (such as a token) for the registry user.
Viewing clusters
Log in to the Neutree management interface. Click Clusters in the left navigation pane. The cluster list on the right shows all current clusters. Click a cluster name to view its details.
On the details page, you can view Basic Information, Monitor, and the Ray Dashboard as needed.
Possible cluster states during operation and their descriptions:
| State | Description |
|---|---|
| Initializing | The cluster is performing its initial initialization. |
| Running | The cluster is operating normally. |
| Updating | The cluster configuration has changed and the new configuration is being applied. |
| Upgrading | The cluster is undergoing a version upgrade. |
| Failed | The cluster is experiencing an error. Check node status and logs. |
| Deleting | The cluster is being deleted and resources are being cleaned up. |
Editing a cluster
After a cluster is created, you can modify the worker node configuration and model cache path as needed.
-
Log in to the Neutree management interface. In the cluster list or details page, click the menu icon (…) and select Edit.
-
Modify the configuration as needed. For parameter descriptions, see Creating a cluster.
-
After confirming the configuration is correct, click Save to complete the edit.
Deleting clusters
You can delete one or more clusters at the same time.
-
Log in to the Neutree management interface. In the cluster list or details page, click the menu icon (…) and select Delete; or select multiple clusters in the list and click Delete above the list.
-
In the confirmation dialog, confirm the deletion and click Delete. The selected clusters will be permanently deleted.