Dify performance bottleneck? Higress AI Gateway injects it with the "soul of high availability"!

Zhao Yuanxiao、Wang Haoting

Aug 19, 2025

Share on X

Dify, as an open-source AI application development platform, has gained widespread attention in community and cloud deployments due to its flexible workflow orchestration and user-friendly interface. On GitHub, Dify sees an average of around 20 issues opened daily, maintaining a high level of activity in the open-source community.

However, as the user base of Dify expands and production implementations increase, performance issues within Dify applications have gradually surfaced, affecting user experience and the stability of production environments.

This article will introduce how to ensure the high availability of Dify applications across the entire chain using the Higress AI gateway and provide operational practical guides.

Performance Issues of Dify Applications

For a Dify AI application providing external services, the core dependencies for normal operation include: Dify system components, model services, MCP services, vector databases, and memory repositories. The nodes that are more prone to performance bottlenecks are the Dify system components and model services.

Dify System Components

In high-concurrency scenarios, Dify system components can easily reach CPU performance bottlenecks. In a certain internal production practice, under the condition that all Dify server component specifications are 4C8G single instances, performance testing on just a 10-node Dify workflow application showed that around 10qps fully utilized the CPU, rendering both the Dify application and Dify management interface unavailable.

The performance issues of Dify system components stem primarily from the architecture and logic of Dify itself, with the main influencing factors including:

1. Workflow engine implementation. Multiple Dify applications built on a single Dify system share the same workflow engine for execution. Besides linking and completing user-defined processes, the workflow engine also manages state transitions, frequent data read/write operations, and generates and stores observational data. These logics consume considerable computational resources, leading to resource consumption even for simple logic due to these additional yet necessary execution logics.

2. Long execution chain. Taking the invocation of large models in Dify applications as an example, when the workflow engine runs in the API component and needs to call a model, it first invokes the daemon-plugin component, which then performs the actual model call and returns the result to the API. Due to the current version of Dify implementing a plugin-based architecture, most nodes or tools exist in the form of plugins, resulting in frequent inter-component calls in a workflow.

3. Python language implementation. The core server component API of the Dify system is implemented in Python, which has relatively poorer performance compared to languages like C++, Golang, or Rust.

From the above analysis, it can be seen that to improve the performance of Dify system components, continuous optimization on the source code side is required, and in the short to medium term, it will be difficult to achieve significant performance improvements.

Model Services

For scenarios involving self-built model services, due to the high consumption of GPU compute power and memory during the model inference process, when Dify applications depend on self-built models under high concurrency, issues can occur where application request response times double or freeze due to exhausted model service resources, severely affecting user experience and potentially causing the Dify system itself to crash.

Given the current conditions in the domestic market, where it is still hard to obtain a GPU, expanding GPU resources to support higher concurrency for greater throughput can mean significant financial costs, and the cost-effectiveness of this may need careful assessment.

Addressing the above issues, given that the performance of Dify system components will not see significant short-term improvements and that there are currently no plans to expand GPU resources, high availability management for Dify applications becomes particularly important in production-grade scenarios when facing sudden or concurrent requests.

Introduction to the AI Gateway

The Higress AI gateway serves as a bridge between external entities and enterprise AI applications, as well as between enterprise AI applications and large language model services and MCP services, aiming to tackle challenges such as complex model integration, security and compliance difficulties, and low management efficiency, providing a unified governance entry point. Its core features include:

Protocol Standardization: Unifies diverse model APIs into an OpenAI compatible format
Observability System: Provides Token-level monitoring (QPS/success rate/latency) and end-to-end request tracking
Security Protection Layer: Implements automatic rotation of API-KEYs, JWT authentication, and real-time interception of sensitive content
Stability Engine: Integrates multi-level Fallback, AI caching, Token rate-limiting, and other governance capabilities

The AI gateway enhances the efficiency, safety, and stability of enterprises in utilizing AI technology through simplified integration, unified governance, strengthened security, and accelerated responses.

High Availability Capabilities of AI Gateway

The AI gateway provides a comprehensive set of high availability assurance capabilities designed specifically for AI applications and models, ensuring stability and reliability of applications and model services:

Multi-dimensional Request Rate Limiting: Supports refined control of request volumes across services such as applications and models, based on multiple time scales (seconds, minutes, hours), effectively preventing sudden and high-concurrency traffic from overwhelming applications and model services, thereby ensuring system stability.
Token-Level Resource Flow Control: Besides request volume, it provides flow control capabilities based on Token consumption, allowing for more precise management of large model resource usage, avoiding depletion of resource pools due to individual "large requests" affecting other users, achieving a fairer resource allocation.
Model Fallback: When the primary model service experiences faults or abnormal responses, the gateway can automatically and transparently switch requests to a pre-set backup model service, ensuring that core business operations are uninterrupted and achieving second-level disaster recovery.
Model Load Balancing: For self-built model clusters, it offers multiple intelligent load balancing strategies, including GPU awareness and prefix matching, significantly enhancing system throughput and reducing response latency without adding hardware costs, maximizing GPU resource utilization.
AI Cache: Caches high-frequency, repetitive request results and returns responses directly from the gateway, significantly reducing the frequency of calls to the underlying large model, thereby improving response speed and effectively saving model invocation costs.

AI Gateway Proxies Dify Application Traffic

To enhance the high availability of Dify applications using the Higress AI gateway, it is necessary to integrate the AI gateway with the Dify system. The integration plan we provide is depicted in the diagram below. In the original architecture, Dify's built-in Nginx acts as a reverse proxy for incoming traffic, where Dify directly calls large models, RAG services, MCP servers, etc. In the new architecture, the AI gateway replaces Dify's built-in Nginx and serves as a proxy for incoming and outgoing traffic of Dify applications.

At the incoming traffic proxy point, we recommend replacing Nginx with the AI gateway rather than routing the AI gateway to Nginx for the following reasons:

1. Full Capability Coverage: The AI gateway fully covers the proxy capabilities of Nginx, while additionally providing over 20 AI-specific governance strategies. The default buffering mechanism of Nginx disrupts SSE streaming, requires manual adjustment of complex parameters, and lacks deep observability support.

2. Streamlined Architecture: Incoming traffic connects directly from the AI gateway to Dify services, eliminating redundant gateway layers. A dual-gateway architecture (AI gateway → Nginx → Dify) not only adds extra network hops, leading to performance degradation, but also complicates problem diagnosis by requiring an additional layer of Nginx error checking, reducing fault localization efficiency.

3. Operational Cost Optimization: Nginx instances need to be deployed independently and consume extra computing/memory resources, requiring manual maintenance for scaling. Changes to routing configuration must be synchronized across two systems, significantly increasing the risk of configuration inconsistency. In contrast, the AI gateway provides enterprise-level SLA guarantees, with a native integrated monitoring alarm system, leading to lower maintenance costs.

Operational Guidelines and Capability Impact

AI Gateway Traffic Proxy Configuration

Dify Application Incoming Traffic Proxy

The AI gateway supports the creation of Agent APIs to proxy access to AI applications and provides capabilities for observation, security, high availability governance, etc., for traffic accessing AI applications. This article will focus on the high availability governance capabilities for detailed introduction.

Creating Service Sources

Create service sources for the Dify API component in the AI gateway. If your Dify deployment is on SAE or ACK, you can follow the methods below to create the service.

SAE (Deploy Dify through SAE or Computing Nest Dify Community Edition - Serverless)

Add the application dify-api-{namespace} under the SAE namespace where Dify is deployed at the service-service-create service section.

ACK (Deploy Dify Community Edition - High Availability Version via ACK Helm or Computing Nest)

After creating the container source at the service-source-create source section, add the ack-dify-api under the dify-system namespace in the corresponding container service at the service-service-create service section.

Configuring Routing

Next, configure routing for Dify services in the AI gateway using the Agent API method. The steps are as follows:

Create Agent API. Click Agent API - Create Agent API, where the domain name and base path can be configured according to actual conditions for accessing Dify through a domain name and avoiding path conflicts with other services; check the option to remove when forwarding to the backend service and select the Dify protocol.

Create Routing. Click on the API created in the previous step in Agent API, click on create routing. If your Dify has a workflow application, select the path /v1/workflows/run. If your Dify has an agent application, select the path /v1/chat-messages and choose the API service created in the previous step.

Please note that if you wish for a routing entry to correspond to a Dify application, you can set matching rules for request headers and parameters in more matching rules. For example, by setting header-key=app-id, please ensure that you carry the corresponding matching content when accessing the Dify application through the AI gateway.

Application Access Validation. Use the domain name + path configured in the above steps to access Dify applications. If it can be called normally, then the incoming traffic proxy configuration is successful.

Dify Application Model Outgoing Traffic Proxy

The AI gateway supports creating LLM API proxies to access large models, supports creating MCP servers to implement MCP proxies, and supports using RAG plugins to realize RAG retrieval proxies, providing rich observability, security, and high availability governance for different types of traffic.

Since accessing large models is the core scenario of AI applications, this section will primarily introduce the operational methods for the AI gateway proxying large model traffic, along with corresponding high availability governance capabilities.

Create LLM API in the AI gateway console for accessing self-built or third-party large models, refer to Managing LLM API
Go to the Dify application market and install the OpenAI-API-compatible plugin

In the Dify console, click on settings in the upper right corner - model providers, and add a model for OpenAI-API-compatible. Taking LLM models as an example, the model type, model name, and API Endpoint URL are required, where the API Endpoint URL is the domain name + prefix of the LLM API created in the AI gateway in step 1. Other parameters can be configured as needed, and then click save.

Select the nodes in the application that require choosing a model and choose the model created in step 3.

Run workflows or agents in Dify and verify that accessing LLM through the AI gateway is able to return results normally, indicating that the outgoing model traffic proxy configuration is effective.

Dify Application Incoming Traffic Governance

Using the AI gateway as the traffic entry point for Dify applications, traffic control can be performed at the AI gateway through configuring cluster rate-limiting, on a global and application level. This section will briefly introduce and demonstrate application-level rate-limiting as an example.

To achieve global and application-level rate-limiting, additional Redis instances must be introduced for counting. In the storage components of the Dify system, Redis is one of the necessary components, thus allowing reuse of Dify system’s Redis.

After creating a Redis application in the AI gateway, use the Key-based cluster rate-limiting plugin in the plugin market. By configuring the plugin rules, you can implement rate-limiting strategies for different Dify applications. For detailed usage of the Key-based cluster rate-limiting plugin, see Key-based Cluster Rate Limiting.

For example, suppose we set a Dify application to allow only one request every minute.

When invoking this application, if a second request is made within one minute, it will trigger the AI gateway's rate-limiting rules, resulting in a call failure. However, when calling other applications, there are no flow control rules, so requests can still be made without restriction.

Dify Application Outgoing Traffic Governance

Request and Token Rate Limiting

Using the AI gateway to proxy model invocations, for scenarios where Dify applications call model services, request rate limiting can be implemented on different time scales, similar to how it was described above, with differences in the rate-limiting positions, which will not be further elaborated here.

In addition to request rate limiting, for scenarios where Dify applications call model services, traffic control based on token consumption can also be implemented. The configuration diagram is shown below. For more detailed usage guidelines, please see Rate Limiting.

In this case, configuring a global limit of 500 tokens/min, after the configuration takes effect, if the model service overall exceeds 500 tokens consumed within one minute, requests to access the model will be directly rejected by the AI gateway.

Fallback

Configuring Fallback capabilities for model access ensures that when the default accessed model service responds abnormally, it can automatically fallback to a backup model service, thereby ensuring high availability of model invocation and improving the overall availability of the Dify applications. For detailed configuration and usage, refer to AI Fallback.

To simulate the effect of Fallback, in this example, configure a main model service that is inaccessible for the Model API accessed by Dify applications, and configure a normal accessible Bailian model service.

Running the Dify application, you will see that the LLM node can still return results normally and the workflow operates smoothly.

From the gateway logs, we can see that the AI gateway encountered a 503 error when accessing the primary model service, which automatically fell back to the Bailian qwen-turbo model for invocation, returning normal call results and ensuring the overall normal operation of the Dify application.

Load Balancing

For scenarios where Dify accesses self-built models (such as those built on Alibaba Cloud PAI), the AI gateway provides load balancing strategies tailored for LLM services, including global minimum request number load balancing, prefix matching load balancing, and GPU-aware load balancing, significantly enhancing system throughput, reducing response latency, and achieving fair and efficient task scheduling, without increasing hardware costs. For specific introduction and usage details, refer to AI Load Balancing and No Increase in GPU, First Token Latency Reduced by 50% | New Practices in LLM Service Load Balancing.

For example, taking prefix matching load balancing into account, using the performance testing tool NVIDIA GenAI-Perf, we set an average input of 200 tokens per round, an average output of 800 tokens, with a concurrency of 20, each session containing 5 rounds of dialogue, totaling 60 sessions, the performance metrics comparison before and after is as follows. It can be clearly felt that, through reasonable load balancing strategies, the latency for the first token can be reduced by 50%.

Indicators	No Load Balancing	Prefix Matching Load Balancing
TTFT (First Token Latency)	240 ms	120 ms
Average RT	14934.85 ms	14402.36 ms
P99 RT	35345.65 ms	30215.01 ms
Token Throughput	367.48 (token/s)	418.96 (token/s)
Prefix Cache Hit Rate	40 % +	80 % +

Taking prefix matching load balancing as an example, the load balancing strategy for Model APIs accessed by Dify applications can be easily configured through plugins, as illustrated in the configuration example below.

Once the plugin is activated, Dify applications can initiate calls to LLM services through the AI gateway, thereby achieving load balancing for self-built LLM service instances.

Conclusion

With the empowerment of the Higress AI gateway, Dify is no longer a "lone wolf" open-source platform, but has acquired a powerful enterprise-level governance and enhancement engine. Whether it's high availability protection at the application layer or stability and performance optimization at the model layer, the AI gateway provides out-of-the-box solutions, liberating developers from complex operations, allowing for more focus on business innovation.

In addition to high availability capabilities, Higress AI also offers security, RAG proxy, and other rich capabilities to enhance the security of Dify applications and facilitate seamless integration with external RAG engines. If you are using Dify and are troubled by issues of high availability, security, RAG, etc., then using the Higress AI gateway to proxy Dify application's incoming and outgoing traffic is definitely a best practice you cannot miss.

Contact

Follow and engage with us through the following channels to stay updated on the latest developments from higress.ai.

https://medium.com/@higress_ai