Skip to content
Neutree Documentation

Managing endpoints

It is recommended to use the model catalog to create endpoints. Neutree will automatically populate the relevant configurations from the model catalog. After making any necessary modifications, you can quickly create an endpoint. Neutree also supports creating endpoints without selecting a model catalog by manually entering the configuration parameters.

  1. Log in to the Neutree management interface. In the left navigation pane, click Endpoints, then click Create on the right-side page.

  2. Fill in the configuration information.

    Parameter Description Editable after creation
    Name The name of the endpoint. No
    Workspace The workspace that the endpoint belongs to. No
    Cluster The cluster that the endpoint belongs to. Yes
    Model registry The model registry that the endpoint uses. Yes
    Model catalog (template) Optional.
    • When you select a model catalog for the endpoint, the built-in parameters of that model catalog are populated into the custom configuration fields, which you can modify as needed.
    • If no model catalog is selected, you need to fill in the model configuration manually.
    No
    CPU The number of CPU cores allocated to the model. Yes
    Memory The amount of memory allocated to the model. Yes
    Accelerator The type and model of accelerator allocated to the model. Yes
    Accelerator count The number of accelerators allocated to the model. Static node clusters support logical partitioning, allowing values less than 1. Yes
  3. Click Custom settings and fill in the custom configuration as needed.

    If you specified a model catalog in the basic configuration, the custom configuration fields will be automatically populated with the built-in parameters of that model catalog. You can edit them as needed; modifications here will not affect the model catalog. If no model catalog was specified, fill in the custom configuration as needed.

    • Model settings

      Parameter Description Editable after creation
      Model name The name of the model used by the endpoint. Yes
      Model version

      The version of the model used by the endpoint.

      For multi-version models, you can specify a particular version as needed.

      If left empty, Neutree will automatically use the latest model version: the main version for Hugging Face model registries, and the latest version for file system model registries.

      Yes
      Model file The model file used by the endpoint. Enter the entry file in the model folder.
      • Models of the safetensors type do not require this field.
      • For models of the GGUF type, select the GGUF file of the required quantization version as the entry file. For example, for the 8-bit quantization version, select the file ending in *8_0.gguf as the entry file.
      Yes
    • Engine settings

      Parameter Description Editable after creation
      Engine The inference engine for the endpoint.
      • vllm: A mainstream open-source inference engine with efficient model inference capabilities, suitable for NVIDIA GPU and AMD GPU scenarios.
      • llama-cpp: A lightweight inference engine suitable for CPU-only scenarios. Requires models in GGUF format.
      Yes
      Engine version

      The engine version for the endpoint. Defaults to the latest version; you can select a specific version as needed.

      Yes
      Task type The task type for the endpoint. An endpoint may support multiple task types; select the task type that matches the model you are using.
      • Text generation: The most common LLM inference scenario.
      • Text embedding: Generates vector representations of text, commonly used for text similarity computation.
      • Text reranking: Ranks text based on similarity, commonly used for search result reranking.
      Yes
    • Replica settings

      ParameterDescriptionEditable After Creation
      ReplicasThe replica policy for the endpoint. Multiple replicas enable high availability; each replica consumes one set of resource allocations.Yes
      Scheduler typeWhen using the vLLM inference engine, it is recommended to use the Consistent hashing policy. The scheduler will be aware of the KVCache distribution in the engine and intelligently optimize inference request routing to effectively improve KVCache hit rates.
      For other scenarios, the Round robin policy is recommended to maintain even load distribution.
      Yes
    • Advanced options

      • Engine variables: You can select parameters defined in the engine or enter custom key-value pairs. After the endpoint is created, you can edit existing engine variables and add new key-value pairs. When using the vLLM engine with multiple accelerators configured, the system will automatically set tensor_parallel_size equal to the number of accelerators; no manual configuration is required.
      • Environment variables: You can enter custom key-value pairs. After the endpoint is created, you can edit existing environment variables and add new key-value pairs.
  4. After confirming the configuration is correct, click Save to complete the creation.

Log in to the Neutree management interface. In the left navigation pane, click Endpoints. The endpoint list on the right will display all current endpoints. Click an endpoint name to view its details. On the details page, you can view Basic, Ray Dashboard, Monitor, Logs, and Playground as needed.

On the Basic tab of the details page, the Service URL is the API URL that external services use to call this endpoint. After creating an API key for any OpenAI API-compatible client, the client can use the API key to call this API URL for integration.

Neutree provides a Ray dashboard, monitoring, and logs to help you understand the running status and performance metrics of endpoints, enabling you to identify and resolve issues promptly.

  • Ray Dashboard: The Ray dashboard is only available for static node clusters. Select the Ray Dashboard tab on the details page to view the Ray dashboard information for the endpoint. The Ray dashboard allows you to track application performance and is used for monitoring and debugging Ray applications.

  • Monitor: Select the Monitor tab on the details page to view real-time monitoring information for the endpoint. You can filter monitoring data by time range and set the auto-refresh frequency for monitoring data. For multi-replica endpoints, you can use the replica selector to view monitoring metrics for individual replicas. For endpoints using the vLLM engine, you can also view the monitoring data reported by the vLLM engine.

  • Logs: Select the Logs tab on the details page to view application logs, error output, and standard output of the endpoint. You can quickly filter logs by keyword and time range, with support for log download and auto-refresh.

Neutree provides a Playground where you can quickly verify that an endpoint’s functionality is working correctly and that the model’s output meets expectations, enabling you to adjust the endpoint configuration promptly.

Select the Playground tab on the details page. The platform will display the corresponding test interface based on the endpoint’s task type.

  • For Text generation tasks, a conversation window is displayed for testing the basic functionality of the inference API.

  • For Text embedding tasks, a set of editable text entries is displayed. After clicking Generate, the similarity relationships between the generated text vectors are displayed.

  • For Text reranking tasks, an editable prompt and a set of editable related texts are displayed. After clicking Generate, the reranking results showing the relevance between the texts and the prompt are displayed.

After an endpoint is created, you can modify its resource, model, engine, replica, and advanced options configuration based on actual requirements.

  1. Log in to the Neutree management interface. In the endpoint list or details page, click the menu icon () and select Edit.

  2. Modify the configuration as needed. For detailed parameter descriptions, refer to Creating an endpoint.

  3. After confirming the configuration is correct, click Save to complete the edit.

When an endpoint is temporarily not needed, you can suspend it to release the resources it occupies. After suspending an endpoint, the accelerator resources (such as GPUs), CPU, and memory it occupies will be released, and you can allocate these resources to other endpoints or applications. During suspension, the endpoint’s configuration, model files, and history are fully retained, and you can resume the endpoint at any time to continue providing inference services.

Procedure

  1. Log in to the Neutree management interface. In the endpoint list or details page, click the menu icon () and select Pause. The endpoint will enter the suspended state and its resources will be released.

  2. Verify that the endpoint status has been updated to Paused.

Follow-up

To resume an endpoint, click the menu icon () in the endpoint list or details page and select Resume. The endpoint will be reallocated resources and resume operation.

You can delete one or more endpoints at the same time.

  1. Log in to the Neutree management interface. In the endpoint list or details page, click the menu icon () and select Delete; or in the endpoint list, select multiple endpoints to delete and click Delete above the list.

  2. In the dialog box that appears, confirm the deletion and click Delete. The selected endpoints will be permanently deleted.