Best practices for high availability of LLM based on AI gateway

Chu Gui

|

Sep 4, 2025

|

Share on X

With the widespread deployment of LLM services, the availability and traffic governance of these services face new reliability requirements. Due to the limitations on the number of parameters in LLM services, their deployment and restart times are relatively long. If the service fails due to overload, the restart time can reach up to several minutes, significantly impacting service availability. Alibaba Cloud AI Gateway provides proxy functionality for LLM services from multiple sources, which not only allows easy configuration for proxying LLM services but also offers rich traffic governance capabilities for LLM service entry points, improving observability and availability of LLM services. For self-deployed LLM services, traditional gateway detection and overload protection mechanisms are often delayed. Alibaba Cloud AI Gateway provides a series of high availability mechanisms such as passive health checks, first packet timeouts, and fallbacks, enabling real-time detection of overload status for LLM services and timely protection through reasonable configuration.

1. Problem Scenarios

User traffic is characterized by bursts and uncertainty. If user traffic surges causing peaks, it can have a significant impact on the availability of the LLM services. For example, when the LLM service processes a large number of requests simultaneously, it can lead to excessively long response times, resulting in degraded user experience. Furthermore, due to the limited GPU memory of the LLM service, it may crash when processing a large number of simultaneous requests because the memory becomes fully utilized.

Here, the model type is selected as DeepSeek-R1-Distill-Qwen-7B, and the resource type is selected as
ml.gu7i.c8m30.1-gu30, with 24G of GPU memory.

Checking the GPU usage of the deployed LLM service has already reached 99%.

When user requests surge, we can see that the first packet response time increases with the number of requests, indicating that the load pressure on the LLM service is also gradually increasing.

Without enabling the related high availability capabilities of LLM, the user concurrent traffic eventually exceeds the acceptable limits of the LLM service, causing it to crash, with a restart time of around 3 minutes during which service cannot be provided.

2. High Availability Assurance of Alibaba Cloud AI Gateway LLM Service

Fallback Mechanism

Based on the fallback mechanism, when the primary LLM service is unavailable, a fallback to the backup LLM service is implemented. This is the most fundamental way to achieve high availability of LLM services on the AI Gateway. Let’s take a common use case in an AI gateway customer: when a self-built LLM model is unavailable on Alibaba Cloud, it falls back to Alibaba Cloud Bai Lian based on the AI Gateway's fallback mechanism.

In Alibaba Cloud, create a new gateway instance under AI Gateway. Navigate to the service tab, click the create service button, and select AI service as the service source. For large model supplier, select PAI-EAS, and the AI Gateway will automatically recognize the created PAI-EAS service. By selecting the workspace and the specified EAS service, LLM API-KEY will be automatically retrieved from PAI. After clicking confirm, the service will be successfully created.

Then select the LLM API tab, click the create LLM API button, choose the recently created service from the LLM service - service list, fill in basic information, enable fallback, select Bai Lian as the fallback backup service. After clicking confirm, the LLM API will be successfully created.

Click on debug in the LLM API operation options to quickly start a conversation on the AI Gateway, and the response will be normally received from the AI Gateway to the backend PAI-EAS service.

Then select abort traffic in PAI-EAS to simulate an abnormal condition of the backend service.

At this point, when a user sends a dialogue request, due to the unavailability of the primary service deployed on PAI-EAS, the request will automatically fallback to the backup service Bai Lian, returning a response that indicates the processing model is qwen, thus ensuring service availability.

Passive Health Check and First Packet Timeout

For self-built LLM scenarios, during traffic surge conditions, due to insufficient resources and the GPU becoming fully utilized, merely relying on the fallback mechanism for post-event protection is insufficient. This scenario can utilize the concurrency and traffic limit protection of the AI Gateway. Additionally, passive health monitoring and first packet timeout mechanisms can be combined for preemptive protection.

The service response time can reflect the load situation of the large model at this time. Specifically, through the first packet timeout configuration, if the first packet response time is too long, it can cause user requests to quickly fail and retry, thus ensuring user experience; when the failure rate of requests becomes too high, it will trigger passive health checks, timely removing backend service nodes to protect against overload; when all service nodes are removed, requests can be fallback to the backup service to maintain service continuity.

The following use case demonstrates how Alibaba Cloud AI Gateway ensures the availability of LLM services during user traffic surges.

First, in the just-created service, select health check configuration, choose to enable passive health checks, then set the failure rate threshold to 50, indicating that if the request failure rate reaches 50%, the service node will be marked as a fault node and removed. Configure the detection interval to 1s, indicating that the request failure rate will be calculated every 1 second, and set the base ejection time to 30s, representing the initial time for node removal.

In the just-created Model API, click edit, and configure the first packet timeout time as 200ms at the bottom of the large model service configuration, indicating that if the first packet time exceeds 200ms, it will trigger a timeout and return a request failure.

The specific configuration and meanings are shown in the table below.

Configuration Name

Configuration Value

Field Meaning

Failure Rate Threshold

50

If the proportion of failed requests for a certain node reaches this threshold, the system will trigger the ejection mechanism for that node.

Detection Interval Time

1s

Calculates the request failure rate at specified intervals (e.g., every 30 seconds).

Base Ejection Time

30s

The initial isolation duration for a node after being ejected (e.g., 30 seconds). The isolation duration calculation formula is: k * base_ejection_time (k initial value is 1). Each ejection will extend the isolation time (k increases by one), while consecutive normal checks will gradually shorten the isolation time (k decreases by one).

First Packet Timeout

200ms

Triggers a timeout if the response time of the first data packet exceeds the specified time, resulting in request failure.

After enabling passive health checks and first packet timeout, during traffic surges, it can be observed that due to the overload of LLM, the first packet time is too long, and many user requests fail, exceeding the passive health check threshold. Finally, the primary service node (PAI-EAS) is removed due to overload, and at this point, all requests are forwarded to the backup service (Tongyi Qianwen) for processing. After a period, the primary service recovers and rejoins to provide service. During this process, the first packet response time continues to rise to a peak, after which the primary service is overloaded, becoming unhealthy. After the primary service resumes normal service, the first packet response time declines and stabilizes, allowing the primary service to continue functioning.

3. Comparison with Other AI Gateway LLM High Availability Capabilities

High Availability Mechanism

higress (Commercial Version)

kong (Commercial Version)

envoy AI Gateway

litellm

Remarks

First Packet Timeout for SSE Responses

Higress supports first packet timeouts for streaming responses, and compared to Litellm, it has the advantage of triggering passive health checks based on first packet timeouts for overload protection.

Fallback

Higress supports fallback based on first packet timeouts, error response codes, health check failures, etc., and is more flexible than other gateway mechanisms.

Health Monitoring

Active: ✅
Passive: ✅

Active: ✅
Passive: ✅

Active: ❌
Passive: ❌

Active: ✅
Passive: ❌

Higress supports both active and passive health checks, and both must pass simultaneously when both are enabled, providing a stronger checking mechanism.

Traffic Limiting

Higress supports more fine-grained traffic limiting based on consumer/model name.

Out-of-the-box Monitoring and Alerts

Higress supports out-of-the-box monitoring and alerts, while other gateways expose monitoring metrics through Prometheus and require custom alert configurations.

Contact

Follow and engage with us through the following channels to stay updated on the latest developments from higress.ai.

Contact

Follow and engage with us through the following channels to stay updated on the latest developments from higress.ai.

Contact

Follow and engage with us through the following channels to stay updated on the latest developments from higress.ai.

Contact

Follow and engage with us through the following channels to stay updated on the latest developments from higress.ai.