Skip to content
Neutree Documentation

Managing static node clusters

Servers or virtual machines can be added as nodes to form a static node cluster. There are two node types: head node and worker node.

  • Head node: Both a control node and a worker node. It runs management services and can also run AI workloads.
  • Worker node: Runs AI workloads only; does not run management services.

The minimum cluster size is a single node (head node only, with no worker nodes required). In this configuration, the head node runs both management services and AI workloads simultaneously.

In a multi-node cluster, it is recommended to use a node without accelerators as the head node for management services, and add nodes with accelerators as worker nodes dedicated to AI workloads.

Plan the number of worker nodes in advance based on your business requirements.

Node requirements

Ensure node configurations meet the following requirements:

  • Resource configuration

    • System disk: 200 GiB
    • CPU: at least 8-core vCPU
    • Memory: at least 16 GiB
  • OS image

    Accelerator typeOS image
    CPU or NVIDIA GPURocky-8.10-x86_64-minimal.iso
    AMD GPUUbuntu-22.04.5-live-server-amd64.iso
  • Port requirements

    If there is a firewall between your static node cluster and Neutree, open the following ports on the target side. Unless otherwise specified, all ports listed below are TCP ports.

    SourceDestinationPortPurpose
    Control planeAll nodes22Used for remote login, static node initialization, and maintenance. If the node uses a non-standard SSH port, see Using a non-standard SSH port to configure the node.
    54311Scrape node runtime status data.
    44217Retrieve monitoring data for auto-scaling.
    44227Export monitoring data for dashboards.
    All nodesAll nodes10002-20000Core channel for inter-node data exchange and distributed computing.
    8077Used for node management.
    8076Used for shared memory object access and distribution.
    56999Used for managing execution environments (dependencies, etc.) on each node.
    Head nodeAll nodes52365, 8078Proxy for dashboard command delivery.
    Control planeHead node8265, 8079Access the graphical management interface.
    8000Entry point for the vLLM model inference service.
    Worker nodeHead node6379Ray cluster metadata center.
    DeveloperHead node10001Allows connecting to the cluster from remote scripts to run jobs.
    All nodesNode where monitoring components are deployed8480Required when monitoring components are deployed on a server or VM to upload monitoring metrics.
    LoadBalancer IP allocated to monitoring components deployed on Kubernetes8480Required when monitoring components are deployed on a Kubernetes cluster to upload monitoring metrics.

Configuring the operating system

Configure the system and install Docker as the container runtime according to the OS type of your nodes.

System configuration

  1. Configure static IP addresses:

    Terminal window
    sudo vi /etc/sysconfig/network-scripts/ifcfg-<interface>

    Replace <interface> with the network interface name, for example eth0.

  2. Configure the DNS server:

    Terminal window
    sudo vi /etc/resolv.conf
  3. Disable the firewall:

    Terminal window
    sudo systemctl stop firewalld && sudo systemctl disable firewalld
  4. Disable SELinux:

    Terminal window
    echo -e "SELINUX=disabled\nSELINUXTYPE=targeted" | sudo tee /etc/selinux/config
    sudo setenforce 0
  5. Install dependencies:

    sudo dnf install rsync pciutils -y

Installing Docker

  1. Install Docker CE:

    Terminal window
    sudo dnf -y install dnf-plugins-core
    sudo dnf config-manager --add-repo https://download.docker.com/linux/rhel/docker-ce.repo
    sudo dnf -y install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
  2. Start the Docker service:

    Terminal window
    sudo systemctl enable --now docker
  3. Confirm Docker is installed successfully:

    Terminal window
    docker --version
  4. Restart the OS for the configuration to take effect:

    Terminal window
    sudo reboot

Configuring accelerators

If the nodes include accelerators, complete the corresponding configuration based on accelerator type:

Follow NVIDIA’s official documentation to complete the following:

  1. Disable the NVIDIA GPU Nouveau driver. See the Disabling the Nouveau Driver for NVIDIA Graphics Cards section in the Virtual GPU Software User Guide.

  2. Install an NVIDIA Graphics driver with a version no higher than 590.x.x and no lower than 530.x.x. See the Installing the NVIDIA vGPU Software Graphics Driver section in the Virtual GPU Software User Guide.

  3. Install the NVIDIA Container Toolkit. See the Installing the NVIDIA Container Toolkit section in the NVIDIA Container Toolkit documentation. If the node cannot access the internet, see Installing NVIDIA Container Toolkit offline.

Preparing SSH private key

Before creating a static node cluster, you need to prepare an SSH private key for node authentication. The control plane uses the SSH private key to securely connect to and manage nodes in the cluster via the SSH protocol.

Creating an SSH key pair

An SSH key pair consists of a public key and a private key, used for node authentication and secure communication. For security, it is recommended to use different SSH key pairs for different clusters.

If you do not have an SSH key pair, create one with the following steps:

  1. Run the following command on the control plane or a local machine to generate an SSH key pair:

    Terminal window
    ssh-keygen -t rsa -b 4096 -C "your_email@example.com" -f ~/.ssh/neutree_cluster_key

    Parameter descriptions:

    ParameterDescription
    -t rsaSpecifies the key encryption algorithm as RSA.
    -b 4096Specifies the key length as 4096 bits.
    -C "your_email@example.com"Adds a comment to the key, typically an email address.
    -f ~/.ssh/neutree_cluster_keySpecifies the save path and file name for the key.
  2. When prompted with Enter passphrase (empty for no passphrase), press Enter to leave it empty (no passphrase).

  3. After the command completes, the key pair is generated at the specified location:

    • The private key file is at ~/.ssh/neutree_cluster_key. The private key is sensitive information; keep it safe and do not share it.
    • The public key file is at ~/.ssh/neutree_cluster_key.pub. This must be configured on all nodes in the cluster.

Configuring the public key on target nodes

The ~/.ssh/authorized_keys file on each node stores the public keys allowed to access it. After configuring the public key on all nodes following this section, SSH will automatically use key-based login without prompting for a password.

  1. Copy the public key content to the ~/.ssh/authorized_keys file on the target node using one of the following methods:

    Terminal window
    ssh-copy-id -i ~/.ssh/neutree_cluster_key.pub <username>@<node_ip>
    ParameterDescription
    <username>SSH username. Must be root or a user with root privileges.
    <node_ip>The IP address of the cluster node.
  2. Set the permissions for the ~/.ssh/authorized_keys file and its parent directory on the target node:

    Terminal window
    ssh <username>@<node_ip> "chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys"

Verifying SSH connectivity

Run the following commands on all nodes in the cluster to test SSH connectivity and ensure each node connects successfully.

  1. Test SSH connectivity using the private key:

    Terminal window
    ssh -i ~/.ssh/neutree_cluster_key <username>@<node_ip>

    You should be able to log in without a password, indicating that the SSH key is configured correctly.

  2. For non-root users, test root privileges:

    Terminal window
    ssh -i ~/.ssh/neutree_cluster_key <username>@<node_ip> "sudo whoami"

    The expected output is root, indicating the user has sudo privileges.

Retrieving the private key content

Before creating the cluster, retrieve the private key content with the following command:

Terminal window
cat ~/.ssh/neutree_cluster_key

The private key content looks similar to:

-----BEGIN OPENSSH PRIVATE KEY-----
b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAACFwAAAAdzc2gtcn
...
-----END OPENSSH PRIVATE KEY-----

Warning

  • The private key is sensitive information. Keep it safe and do not share it with others.
  • Ensure the private key file has permissions set to 600 (chmod 600 ~/.ssh/neutree_cluster_key), otherwise the SSH server may refuse to use the key.

Creating a cluster

Follow the steps below to create a cluster. If the cluster nodes cannot access Docker Hub or the connection is slow, you can manually import cluster images.

  1. Log in to the Neutree management interface. Click Clusters in the left navigation pane, then click Create on the right.

  2. Fill in the configuration.

    • Basic Information

      ParameterDescriptionEditable after creation
      NameThe name of the cluster.No
      WorkspaceThe workspace to which the cluster belongs.No
    • Image Registry

      Select a container registry for the cluster to store cluster-related container images. If no registry is available, see the Creating a container registry section. If no registry is available in your environment, see the Set up a temporary container registry section. This field is not editable after creation.

    • Cluster type

      The cluster type. Select Static Nodes. This field is not editable after creation.

    • Version

      The cluster version. The system automatically retrieves available cluster versions from the selected container registry. After creation, the version can be updated via Upgrading the cluster version.

    • Provider

      ParameterDescriptionEditable after creation
      Head Node IPThe IP address of the head node.No
      Worker Node IPsThe IP address of worker nodes.
      • Not required for single-node clusters.
      • For multi-node clusters, enter an IP address and click + Add to add the next one.
      Yes
    • Node Authentication

      ParameterDescriptionEditable after creation
      SSH UserSSH username. Must be root or a user with root privileges.No
      SSH Private KeyThe SSH private key string. See the Preparing SSH private key section for how to obtain it.No
    • Model Caches

      ParameterDescriptionEditable after creation
      NameThe name of the model cache.No
      Cache TypeStatic node clusters only support Host Path.No
      Cache PathThe host path for the model cache.Yes

      If model cache is not configured during creation, it cannot be added after the cluster is created.

  3. After confirming the configuration is correct, click Save to complete creation.

Manually importing a cluster image

When upgrading the cluster version or when the network environment is restricted, you can manually import the required cluster images into the Neutree container registry.

Procedure

  1. Download version 1.0.1 of the Neutree CLI tool and the cluster offline image for the required accelerator type, according to the server’s CPU architecture.

  2. Use the CLI tool to upload the cluster offline image to the specified registry:

    Terminal window
    ./neutree-cli-<arch> import cluster \
    --package <cluster_package> \
    --mirror-registry <mirror_registry> \
    [--registry-project <registry_project>] \
    --registry-username <registry_username> \
    --registry-password <registry_password>
    ParameterDescription
    <arch>The server’s CPU architecture: amd64 or aarch64.
    <cluster_package>The cluster offline image name, in the format neutree-cluster-ssh-v1.0.1-<arch>.tar.gz.
    <mirror_registry>The registry address must match the registry address used when uploading images with the CLI tool during Neutree management plane deployment. Enter an OCI-compatible image registry address without the https:// prefix.
    --registry-project <registry_project>Optional. The registry project name. Ensure the corresponding project has been created in the registry in advance.
    <registry_username>The username for the registry, must have image upload permissions.
    <registry_password>The login password or access key (such as a token) for the registry user.

Viewing clusters

Log in to the Neutree management interface. Click Clusters in the left navigation pane. The cluster list on the right shows all current clusters. Click a cluster name to view its details.

On the details page, you can view Basic Information, Monitor, and the Ray Dashboard as needed.

Possible cluster states during operation and their descriptions:

StateDescription
InitializingThe cluster is performing its initial initialization.
RunningThe cluster is operating normally.
UpdatingThe cluster configuration has changed and the new configuration is being applied.
UpgradingThe cluster is undergoing a version upgrade.
FailedThe cluster is experiencing an error. Check node status and logs.
DeletingThe cluster is being deleted and resources are being cleaned up.

Editing a cluster

After a cluster is created, you can modify the worker node configuration and model cache path as needed.

  1. Log in to the Neutree management interface. In the cluster list or details page, click the menu icon () and select Edit.

  2. Modify the configuration as needed. For parameter descriptions, see Creating a cluster.

  3. After confirming the configuration is correct, click Save to complete the edit.

Deleting clusters

You can delete one or more clusters at the same time.

  1. Log in to the Neutree management interface. In the cluster list or details page, click the menu icon () and select Delete; or select multiple clusters in the list and click Delete above the list.

  2. In the confirmation dialog, confirm the deletion and click Delete. The selected clusters will be permanently deleted.