Managing Endpoints

We recommend using model catalogs to create endpoints. Neutree will automatically populate relevant configurations from the model catalog, and you can modify them as needed to quickly create endpoints. Manual configuration without selecting a model catalog is also supported.

Create Endpoint

Log in to the Neutree management interface, click Endpoints in the left sidebar, and click Create on the right page.

Fill in the configuration information.

Parameter	Description
Name	The name of the endpoint.
Workspace	The workspace to which the endpoint belongs.
Cluster	The cluster to which the endpoint belongs.
Model Registry	The model registry to which the endpoint belongs.
Model Catalog (Template)	Optional. After selecting a model catalog for the endpoint, the parameters built into the model catalog will be populated into the custom configuration information and can be modified as needed. When no model catalog is selected, you need to fill in the model configuration manually.
CPU	The number of CPU cores allocated to the model.
Memory	The memory capacity allocated to the model.
Accelerator Count	The number of accelerators allocated to the model. Static node clusters support fractional accelerators and can be configured with quantities less than 1.
Accelerator Type	The type of accelerator allocated to the model, such as nvidia_gpu or amd_gpu.
Accelerator Model	The model of accelerator allocated to the model, such as NVIDIA_L20.

Click Custom Settings and fill in custom configuration information as needed.

If you specified a model catalog in the basic configuration, the custom configuration information will be automatically populated with the parameters built into that model catalog, which you can edit as needed. Modifications here will not affect the model catalog. If no model catalog was specified, fill in the custom configuration as needed.

Model Settings

Parameter Description

Model Name The name of the model used by the endpoint.

Model Version

Parameter	Description
Model Name	The name of the model used by the endpoint.
Model Version	The version of the model used by the endpoint. For multi-version models, you can customize the specific version as needed. If left empty, Neutree will automatically use the latest model version: main version for Hugging Face model registries, latest version for file system model registries.
Model File	The model file used by the endpoint. Please specify the entry file in the model folder. safetensors type models do not require this field. For GGUF type models, select the GGUF file of the desired quantized version as the entry file. For example, for the 8-bit quantized version, select the file ending with `8_0.gguf` as the entry file.

The version of the model used by the endpoint.

For multi-version models, you can customize the specific version as needed.

If left empty, Neutree will automatically use the latest model version: main version for Hugging Face model registries, latest version for file system model registries.

Model File

The model file used by the endpoint. Please specify the entry file in the model folder.

safetensors type models do not require this field.
For GGUF type models, select the GGUF file of the desired quantized version as the entry file. For example, for the 8-bit quantized version, select the file ending with 8_0.gguf as the entry file.

Engine Settings

Parameter	Description
Engine	The inference engine for the endpoint. vLLM: A mainstream open-source inference engine with efficient model inference capabilities, suitable for NVIDIA GPU and AMD GPU scenarios. llama-cpp: A lightweight inference engine suitable for CPU-only scenarios, requiring models in `GGUF` format.
Engine Version	The engine version for the endpoint. The latest version is populated by default, but you can select a specific version as needed.
Task Type	The task type for the endpoint. An endpoint may support multiple task types, and you need to select a task type that matches the model being used. Text-generation: Text generation, the most common LLM inference scenario. Text-embedding: Generating text embeddings. Text-rerank: Text similarity reranking.

Replica Settings

Parameter	Description
Replica Count	The replica strategy for the endpoint. Multiple replicas enable high availability for the endpoint, with each replica consuming one set of resource settings.
Scheduler Type	When using the vLLM inference engine, the Consistent Hashing strategy is recommended. The scheduler will be aware of KV cache distribution in the engine and intelligently optimize inference request routing to effectively improve KV cache hit rates. For other scenarios, the Load Balancing strategy is recommended to maintain even load distribution.

Advanced Options

You can select parameters defined in the engine, or enter custom parameter keys and values.

After confirming the configuration is correct, click Save to complete the creation.

View Endpoints

Log in to the Neutree management interface, click Endpoints in the left sidebar, and the endpoint list on the right will display all current endpoints. Click on an endpoint name to view details. On the details page, you can view Basic Information, Ray Dashboard, Monitoring, Logs, and Test Platform as needed.

Get API URL

On the Basic Information tab of the details page, the Service Address is the API URL that external services use to call this endpoint. After creating an API key for any OpenAI-compatible client, the client can use the API key to call this API URL for integration.

When using the API URL, be sure to append /v1 to the URL.

For example, to view the endpoint’s model via API:

curl  <API_URL>/v1/models \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer <api_key>'

View Monitoring and Logs

Neutree provides Ray Dashboard, monitoring, and logs to help you understand the endpoint’s operational status and performance metrics, enabling timely identification and resolution of issues.

Ray Dashboard: Only available for static node clusters. Select the Ray Dashboard tab on the details page to view the endpoint’s Ray Dashboard information. The Ray Dashboard tracks application performance and is used for monitoring and debugging Ray applications.
Monitoring: Select the Monitoring tab on the details page to view real-time monitoring information for the endpoint. You can filter monitoring data by time range and set the frequency for automatic monitoring data refresh. For endpoints using the vLLM engine, you can also view monitoring data reported by the vLLM engine.
Logs: Select the Logs tab on the details page to view the endpoint’s application logs, error output, and standard output. You can quickly filter logs by keywords and time range, and logs can be downloaded.

Quick Testing

Neutree provides a test platform where you can quickly verify that the endpoint is functioning properly and that the model performs as expected, allowing you to adjust the endpoint configuration promptly.

Select the Test Platform tab on the details page, and the platform will display the corresponding test interface based on the endpoint’s task type.

For Text-generation tasks, a chat window will be displayed for testing basic functionality of the inference API.
For Text-embedding tasks, a set of editable text will be displayed. After clicking Generate, the similarity relationships between the generated text vectors will be shown.
For Text-rerank tasks, an editable prompt and a set of editable related texts will be displayed. After clicking Generate, the reranked results showing the association between the texts and the prompt will be shown.

Edit Endpoint

After creating an endpoint, you can adjust the endpoint’s resource settings, engine settings, replica settings, and advanced options according to actual needs.

Log in to the Neutree management interface, click the menu icon on the endpoint list or details page, and select Edit.
Modify as needed on the configuration page. For parameter descriptions, refer to Create Endpoint.
After confirming the configuration is correct, click Save to complete the edit.

Pause Endpoint

When an endpoint is temporarily not needed, you can pause it to release the resources it occupies. After pausing an endpoint, its accelerator resources (such as GPUs), CPU, and memory resources will be released, and you can allocate these resources to other endpoints or applications. During the pause, the endpoint’s configuration, model files, and history will be fully preserved, and you can resume the endpoint at any time to continue providing inference services.

Steps

Log in to the Neutree management interface, click the menu icon (…) on the endpoint list or details page, and select Pause. The endpoint will enter a paused state and resources will be released.
Verify that the endpoint status has been updated to Pausing.

Next Steps

To resume the endpoint, click the menu icon (…) on the endpoint list or details page and select Resume. The endpoint will be reallocated resources and resume operation.

Delete Endpoint

Log in to the Neutree management interface, click the menu icon on the endpoint list or details page, and select Delete.
In the pop-up dialog, confirm and click Delete. The endpoint will be permanently deleted.