Frontline practice of implementing a new paradigm of AI application architecture based on MCP

Ji yuan

Apr 9, 2025

Share on X

Editor’s note: The more intelligent the application, the more complex the design behind it. The essence of software is to solve complexity problems. While MCP opens up the creative limits of intelligence, it also brings infinite complexity to backend design. This article aims to share some of our practical insights starting from the technical principles of MCP, reducing the complexity in building MCP Servers, and improving the stability of Server operation. The article is lengthy; here’s an outline of what to expect.

Get the full 78-page PPT: https://developer.aliyun.com/ebook/8442

Introduce the concept of MCP and its operational mechanism.
Explain the difference between MCP and Function Calling.
Discuss the essence and challenges of MCP, including the challenges of system prompts that describe MCP information, the collaborative relationship between MCP Client and MCP Server, rapidly building MCP Servers, and the pain points of building Dify.
Analyze how to solve various challenges of MCP, including unified management of MCP Register, MCP Server, and Promt, MCP effect verification system and safety assurances, MCP gateways, dynamic service discovery for MCP Server, Streamable HTTP, elastic efficiency, observability, etc.
Finally, discuss the impact of MCP on a new paradigm in AI application architecture and introduce the concept of MCP Server First.

Author: Ji Yuan

Current Status and Architecture of AI Agents

Artificial intelligence (AI) is increasingly becoming a core force driving innovation and efficiency in business. Its core lies in the collaboration of multiple AI Agents, which collectively fulfill the business needs supported by AI applications through division of labor and cooperation. This collaborative model not only optimizes enterprise operations but also demonstrates AI’s potential in solving high-impact challenges.

Current AI Agents interact through HTTP protocols, whether they are interacting with various tools (various business service interfaces), various memories (various storage service interfaces), or various LLMs (various large language models), except for LLMs as they generally adhere to the OpenAI paradigm. Additionally, interacting with other tools and memories requires understanding their return formats one by one for parsing and adaptation. When an AI application contains multiple AI Agents or needs to interact with multiple business service interfaces and storage service interfaces, the overall development workload is substantial, mainly reflected in three aspects:

Finding suitable business interfaces and storage service interfaces for the AI application:
- Finding third-party service interfaces.
- Finding suitable internal service interfaces within the company.
- If not found, developing interfaces by oneself.
Parsing interface return formats: Whether third-party service interfaces or internal company service interfaces, return formats can vary greatly, requiring each one to be understood and parsed.
Orchestrating multiple AI Agents:
- With tools like Dify that assist in visualizing processes, many orchestration tasks are alleviated; however, complexity remains high, and there are still bottlenecks in terms of operational efficiency and performance.
- Orchestrating through coding approaches (e.g., using Spring AI Alibaba or LangChain) may offer better performance but introduces greater complexity, limiting orchestration efficiency and flexibility.

Thus, many AI applications currently only have a few AI Agents, and many have just one AI Agent, which explains why AI Agents remain in the first stage (Siloed, Single-Purpose Agents).

To help AI Agents reach the second stage (Platform-Level Agents), we use a cloud-native API gateway as a unified access layer. By employing a gateway in three different roles, we solve part of the complexity:

As a north-south traffic gateway, it manages the incoming traffic to AI Agents, primarily performing forwarding, load balancing, authentication, authorization, and safety control.
As an AI gateway, it proxies various LLMs, shielding AI Agents from complex access requirements and solving many production-grade issues such as multi-model switching, model fallback, multi-API Key management, safety, and online searching.
- For detailed articles on AI gateway proxying LLMs, see: https://mp.weixin.qq.com/s/tZ0wsTlZK67r9IxNZ57TDQ
As an east-west gateway, it manages various services from different sources (ACK, ECS, Function Compute FC, SAE, third-party services) for AI Agents to utilize.

However, as mentioned, this only addresses part of the complexity issue; the core issues of finding interfaces and parsing interfaces remain unsolved. It was not until the emergence of MCP (Model Context Protocol) that we glimpsed a real pathway to reaching the second stage (Platform-Level Agents), and potentially even approaching the third stage (Universal Agents, Multi-Agents).

What is MCP

MCP stands for Model Context Protocol, an open-source protocol developed by Anthropic (the Claude development company) designed to allow large language models (LLMs) to connect to external data sources and tools in a standardized manner. It operates like a universal interface for AI applications, helping developers build more flexible, context-aware AI applications without the need for custom integration for each AI model and external system. MCP is designed as a universal interface, similar to a USB-C port, allowing LLM applications to consistently connect to a variety of data sources and tools such as files, databases, APIs, etc.

MCP currently has three core concepts:

MCP Server:
- A program or service developed based on various language MCP SDKs.
- Transformed into an MCP Server from existing programs or services based on some **mysterious mechanism**.
MCP Tool:
- MCP Tools belong to the MCP Server, where one MCP Server can have multiple MCP Tools. This can be understood as a class with multiple methods or similar to a service with multiple interfaces.
MCP Client: When a piece of code, an Agent, or a client uses and calls the MCP Tool in the MCP Server based on the MCP specification, it acts as an MCP Client.

How MCP Works

To truly understand what MCP is, we need to grasp its operational mechanism, allowing us to see how MCP calling differs from traditional HTTP calling and perhaps gain a vague sense of why I've said MCP can advance AI Agents to the second stage.

As shown in the above figure, there are six core steps in a call based on MCP. Let’s first establish a premise:

I want to develop an AI Agent to get the time. When users interact with this AI Agent, they simply need to ask questions like, "What time is it now?"
I already have an MCP Server that handles time, which contains two MCP Tools: one for fetching the current timezone and one for getting the current time.

Call Step Analysis:

Step 1: The user asks the AI Agent, "What time is it now?" At this point, the AI Agent acts as the MCP Client and sends the user’s question along with the MCP Server and MCP Tool information to the LLM.
Step 2: Once the LLM receives the information, it begins reasoning by selecting the most appropriate MCP Server and MCP Tool to solve the user’s question based on the MCP Server info and user context, then returns it to the AI Agent (MCP Client).
- The LLM returns the message to the AI Agent: "Use the get_current_time MCP Tool from the time MCP Server; it can resolve the user’s question."
Step 3: The AI Agent (MCP Client) now knows which MCP Server and MCP Tool to use; it directly calls that MCP Tool to obtain results.
- Calling the get_current_time MCP Tool from the time MCP Server.
Step 4: The Time MCP Server returns the result (the current time) to the AI Agent (MCP Client).
Step 5: The AI Agent (MCP Client) is also lazy, sending the user’s question and the result obtained from the Time MCP Server back to the LLM, aiming to let the LLM refine the content based on both the question and answer.
Step 6: The LLM returns the neatly organized content to the AI Agent (MCP Client), and finally, the AI Agent (MCP Client) relays it unchanged to the user.

Throughout the MCP calling process, a very key point is the **information of MCP Server and MCP Tool.** As seen in Steps 1 and 2, this information is critical as it informs the LLM on how to address user inquiries—this information serves as the most crucial System Prompt in MCP, fundamentally representing prompt engineering in PE projects.

MCP System Prompt

Unlike traditional protocol definitions, MCP does not have a fixed data structure. Its core lies in clearly describing which MCP Servers and MCP Tools exist and their respective functions through natural language, enabling the large language model to select the most suitable MCP Server and Tool. Therefore, its essence fundamentally revolves around prompt engineering.

The two images above demonstrate the System Prompt within Cline (an MCP Client), clearly detailing the functions and roles of MCP Servers and Tools.

The image above illustrates the first step of sending the user’s question along with the System Prompt to the LLM.

The image above depicts the second step, where the LLM returns clear MCP Server and Tool information to solve the user’s query.

Differences Between MCP and Function Calling

Having reached this point, I believe you have a certain understanding of what MCP is. Did MCP resolve the issues of finding interfaces and parsing interfaces? Because both tasks are handed over to the LLM.

The LLM helps the AI Agent find the most appropriate interface.
The AI Agent calls the interface directly without needing to parse the return results; it simply returns them unchanged to the LLM.
The LLM combines the user’s questions and the results returned from the interfaces to process and organize content.

Some may ask, what’s the difference between MCP and Function Calling in LLMs? The core difference lies in whether it binds the model or vendor:

MCP is a standard for the general protocol layer, akin to the "USB-C interface of AI", defining the communication format between LLMs and external tools/data sources but does not tie to any specific model or vendor. It abstracts complex function calls into a client-server architecture.
Function Calling is proprietary capability provided by large model vendors, defined by them, and differing between vendors. It allows models to directly generate function calls, triggering external APIs, relying on the model’s inherent context understanding and structured output capabilities.

LLM Function Calling requires LLMs to write JSON Schema formatted function descriptions for each external function, meticulously designing a prompt template to enhance the accuracy of Function Calling responses. If a requirement entails dozens of external systems, the design cost can be significant, and productization costs dramatically increase. However, MCP standardizes the operational specifications for both clients and servers and mandates that communication between MCP clients and servers follow a predetermined prompt template, thereby bolstering collaboration among global developers and reusing worldwide development outcomes.

The Essence and Challenges of MCP

Based on the previous series of explanations, we can summarize the essence of MCP as follows: The Model Context Protocol (MCP) is not a definitive data format or structure; it describes the system prompts for MCP Server/MCP Tool information and the synergy between MCP Server and LLM. It fundamentally addresses the issues of finding interfaces and parsing interfaces.

Once the essence of MCP is clarified, bringing it into enterprise-grade production applications reveals numerous challenges or deficiencies in these two core areas.

Challenges of Describing System Prompt Information for MCP

How to ensure the security of system prompts?
- If the core system prompt gets corrupted, the LLM won’t accurately know which MCP Servers and Tools exist, possibly leading to invalid or insecure MCP Server and Tool notifications to the LLM, posing substantial risks to your AI applications and potentially leading to a total breakdown of the MCP workflow.
How to manage system prompts?
- If new versions of MCP Servers or Tools arise, the system prompts must have corresponding version management strategies.
If system prompts are poorly written, how can they be debugged easily and quickly? Can they take effect in real-time?
- Since system prompts have no standard definition, theoretically, each enterprise can define its own system prompt templates similar to PE projects. Writing prompts perfectly cannot be accomplished at once; they require repeated adjustments and a mechanism to facilitate quick changes that can take effect in real-time.
If there are many MCP Servers, will not the system prompts be excessively lengthy and consume numerous tokens? How to narrow down or specify the MCP Server and Tool scope?
- With dozens or more MCP Servers, one can potentially have hundreds or more MCP Tools. Documenting all the information within the system prompt will result in a significantly large prompt template, evidently consuming an excessive amount of tokens, translating to high costs. A mechanism should be needed to pre-select the MCP Servers and Tools limited based on user queries to reduce token use and increase efficiency, which is akin to intent recognition in online search.

Challenges in the Collaboration Between MCP Client and MCP Server

The collaboration is the responsibility of the MCP Client, yet currently, there are very few MCP Clients, like Cline, Claude, Cursor, etc., all of which are C/S tools supporting the SSE protocol. How can enterprise-grade AI applications be integrated? Can they be?
- Basically, the existing MCP Clients in the market cannot be integrated into enterprise-level AI applications; the stateful protocol SSE has many disadvantages, such as lacking recoverability, requiring the server to maintain long-term connections, only permitting server → client messages, and lacking flexibility for bidirectional communication.
Is it feasible to rapidly convert existing traditional businesses into MCP Servers? Can it be done with zero code changes?
- Building an MCP Server is highly reliant on the MCP SDK for various programming languages, currently only supporting Python, Java, TS, Kotlin, and C#. What if a company uses Go or PHP? Furthermore, refactoring all existing businesses with the MCP SDK represents a monumental workload and is impractical.
With numerous MCP Servers, how can they be uniformly managed?
- There are self-developed MCP Servers, third-party MCP Servers, and many MCP Servers converted from traditional businesses via some mysterious mechanism. A unified management system like an MCP Hub or MCP marketplace should be created to facilitate the use of MCP by Clients.
In enterprise-level AI applications, how can identity authentication, data permissions, and safety measures be implemented?
- In enterprise-grade applications, regardless of protocol, architecture, or business type, issues surrounding identity authentication, data permissions, and security are always paramount. How can these be achieved in the collaborative manner characteristic of MCP?

A New Paradigm in AI Application Architecture

By integrating the MCP paradigm to address the aforementioned challenges, we have restructured the architecture of AI Agents. We have implemented MCP enhancement capabilities in **cloud-native API gateways and microservice engines like Nacos**, addressing many of the challenges mentioned earlier. In **Function Compute FC and Serverless Application Engine SAE**, we added MCP enhancement capabilities; the former resolves the rapid development of MCP Servers, while the latter addresses performance issues with the open-source Dify. Together they construct a new paradigm for AI application development based on MCP.

Analysis of the New AI Application Architecture Paradigm

First, I will analyze the eight core calling links depicted in the diagram:

Step 1: The user initiates a request to the AI application, with request traffic entering the traffic gateway (cloud-native API gateway).
Step 2: The cloud-native API gateway maintains and manages API or routing rules for different types of AI Agents, forwarding the user request to the appropriate AI Agent.
Step 3: Regardless of how the AI Agent is implemented, any node requiring data will request available MCP Server and MCP Tool information from the MCP gateway (cloud-native API gateway).
Step 4: Because multiple MCP pieces of information may be maintained at the MCP gateway, LLMs can narrow the selection of MCPs and reduce token consumption. Therefore, a request is sent to the AI gateway (cloud-native API gateway) to interact with the LLM. (This step is optional)
Step 5: The MCP gateway returns a list of defined MCP Server and Tool information to the AI Agent.
Step 6: The AI Agent sends the user request information and the MCP information obtained from the MCP gateway through the AI gateway to the LLM.
Step 7: After reasoning, the LLM returns one or more MCP Server and Tool information to resolve the issue.
Step 8: After obtaining the defined MCP Server and Tool information, the AI Agent requests the MCP Tool through the MCP gateway.

In actual production, steps 3 - 8 will interact continuously.

We will analyze this new architecture based on the two essences of MCP.

How to Address Various Challenges of MCP Prompts

Our team is one of the most prolific in middleware open source projects, like Nacos, Higress, Sentinel, RocketMQ, Seata, etc., and also maintains open development frameworks like Spring Cloud Alibaba, Spring AI Alibaba, and Dubbo, boasting abundant experience in microservice architecture. Thus, in terms of unified management of MCP Servers and prompts, our minds are naturally drawn to the model of service registration and discovery and unified configuration management based on Nacos in microservices, which can be adapted to the MCP paradigm. Let’s consider the following correspondences:

SpringCloud services/Dubbo services/Go services → various MCP Servers
Exposed interfaces of SpringCloud services/Dubbo services/Go services → various MCP Tools provided by MCP Servers
Exposed interface descriptions of SpringCloud services/Dubbo services/Go services → descriptions of various MCP Tools
Configuration files of SpringCloud services/Dubbo services/Go services → system prompts for various MCP Servers

Therefore, within the MSE Nacos product, we have developed a series of enhancements to empower MCP, turning MSE Nacos into a unified management center for MCP Server, known as MCP Register (MCP Server registration/configuration center). This serves as a core component of the new paradigm for AI application development.

Moreover, the official MCP Roadmap also outlines plans for MCP Register capabilities, and we will collaboratively build it with Nacos as the MCP Register service on the open-source side.

MCP Register (MCP Server Registration/Configuration Center)

Unified Management of MCP Server

MCP Servers can be registered in MSE Nacos in two ways:

Manually create in the MSE Nacos console. This involves configuring the MCP Server’s endpoint in MSE Nacos.
Automatically registering the MCP Server in Nacos via the Nacos SDK, similar to the logic of current Java SpringCloud and Java Dubbo services.

Unified management of MCP Servers in MSE Nacos allows for health checks, load balancing, converting description information from JSON to XML, and controlling the online and offline status of MCP Servers.

Unified Management of MCP Prompt

In MSE Nacos, there are two methods to maintain the MSP Server Prompt:

Manually create the MCP Server configuration information, with the naming format for the Data ID in the configuration file as [MCP Server name]-mcp-tools.json.
- Manage the prompts for MCP Tools in the configuration file, such as overarching descriptions and parameters.
Utilizing MSE governance capabilities, if using Java or Go, the service schema can be automatically sensed, generating prompt information for MCP Servers and Tools automatically.

Unified management of MCP Server prompts in MSE Nacos can implement version management (rollback) for MCP prompts, gray management of prompts, security management for prompts, and dynamic tuning of prompts that can take effect in real-time.

MCP Effect Verification System (in Progress)

As mentioned earlier, when there are numerous MCP Servers, their extensive descriptive information results in lengthy prompts with high token consumption. Thus, mechanisms must be in place to narrow the scope of MCP Servers based on user input to reduce token consumption and boost LLM reasoning efficiency. Besides that, it is well understood that any situation involving interaction with LLMs must prioritize prompt quality, as the entire MCP process heavily depends on prompt engineering. If prompts are poorly adjusted, the LLM may not return accurate MCP Server and Tool solutions, rendering the entire process unusable. Thus, we are developing an MCP effectiveness verification system within Nacos.

The core principle is that we will provide an AI Agent developed based on Spring AI Alibaba. Utilizing user-configured business inputs, LLMs, and a narrowed-down collective of MCP Servers and Tools, constant verification will occur, with outcomes displayed visually (such as success rates). Users can dynamically adjust prompts in Nacos for those MCP Servers yielding a low success rate to enhance their effectiveness.

Security Assurance for MCP (Ongoing Improvement)

Irrespective of the architecture or mode, security is paramount in enterprise production, and these considerations are even more diverse in MCP.

Management of sensitive information for MCP Servers: Various MCP Servers registered in MSE Nacos may contain sensitive data like API Keys, AK/SK, secrets, and login passwords. MSE Nacos integrates deeply with Alibaba Cloud KMS to provide encryption for these sensitive details.
Security management for MCP Prompts: Once again, leveraging the integration of MSE Nacos with KMS can encrypt the complete Prompt (description information) of MCP Server and Tools, preventing Prompt pollution.
Security validation of MCP Prompts: Along with the above verification system and content safety integration, MSE Nacos can conduct legitimacy checks for MCP Prompts.

Addressing Challenges of Collaboration Between MCP Client and MCP Server

Within the MCP paradigm, a trio of roles collaborate:

MCP Client → LLM
MCP Client → MCP Server

These relationships between the two types of collaboration essentially reflect the interplay between service providers and consumers, involving both proxy cooperation and traffic control. In traditional development paradigms, this is typically managed by gateways. Thus, we have enhanced the cloud-native API gateway with LLM proxying and MCP Server proxying capabilities, granting it the dual functionalities of serving as a traffic gateway and an AI gateway (LLM proxy) as well as MCP gateway functionalities. This forms a central component of the new paradigm for AI application development.

Therefore, within the overall system architecture of an enterprise, only a single cloud-native API gateway is required to function as a traffic gateway, API gateway, microservice gateway, AI gateway, and MCP gateway. This achieves a consolidated control of traditional and AI operations and seamlessly intertwines AI tasks with traditional business endeavors.

Cloud-Native API Gateway Dog Food

Adhering to the principle of “eating my own dog food,” the cloud-native API gateway has already seen extensive use internally within Alibaba Group across multiple enterprises, gaining endorsements for its capabilities in enterprise-level products, stability, and performance.

AI Gateway

The interaction between MCP Clients and LLMs mimics that between traditional services and LLMs. Once an application transitions to production, a series of issues arise that must be solved:

Cost balance: For instance, deploying the DeepSeek R1 671B model necessitates at least two 8-card H20 machines, with annual list prices exceeding 1 million, yet the two machines have limited TPS, insufficient to meet concurrent requests from multiple users. Even the newly released Llama4 requires at least one H100 to run. Hence, a solution must strike a balance between TPS and costs.
Model hallucination: Even the full DeepSeek R1 671B model, in the absence of online searching, has a significant hallucination issue.
Multi-model switching: Single-model services pose substantial risks and limitations, like stability risks and the inability to choose the optimal model based on business (consumer) needs. Currently, no open-source components or frameworks address such challenges.
Safety compliance: Enterprise clients require audits of the Q&A process to ensure compliance and mitigate usage risks.
High availability for model services: When self-built platforms reach performance ceilings, an underlying solution utilizing a large model should enhance customer experiences.
Closed-source model QPS/Token limitations: Commercial large models generally impose QPS/Token limits based on API Key dimensions, necessitating effective methods for rapidly generating quota increases.

Each of these issues represents actual challenges encountered by clients during usage; some are intrinsic to the model, while others stem from architectural deployment issues. Compelling customers to tackle them one by one results in high complexities and time costs. Therefore, the intervention of an AI gateway is essential for swiftly and uniformly addressing these core problems.

The AI gateway capabilities enhancement in the cloud-native API gateway comprises four main components:

Multi-model adaptation: It can proxy all mainstream model hosting services on the market and accommodate AI services compliant with the OpenAI protocol. This module comprises core functionalities, including protocol conversion, multi-API Key management, fallback, and multi-model switching.
AI security: Security safeguards are categorized into three levels: one involves content security for inputs and outputs, another ensures the stability of downstream LLM services, and the third manages consumer authentication for AI interface consumers. This module integrates functionality such as content review, token-based traffic limiting and degradation, and consumer authentication.
AI plugins: We implement the AI gateway’s flexible extension mechanisms using plugins, with various pre-configured ones available for users to develop custom ones, enriching traffic control for AI scenarios. For instance, using the AI plugin mechanism, we have realized capabilities like result caching, prompt decorators, and vector retrieval.
AI observability: The observability of AI scenarios differs significantly from traditional scenarios, concerning various monitored metrics. The cloud-native API gateway combines Ali Cloud’s logging services and observability products to realize an observability module aligned with the business semantics of AI applications and supports observable indicators such as token consumption, streaming/non-streaming RT, initial package RT, cache hits, etc. Simultaneously, all input and output tokens are recorded within the log service SLS for more detailed user analysis.

For a more detailed plan regarding the AI gateway proxying LLMs, you can refer to my previous article: https://mp.weixin.qq.com/s/tZ0wsTlZK67r9IxNZ57TDQ

MCP Gateway

The interactions between MCP Clients and MCP Servers differ from traditional service providers and consumers; therefore, we have additionally embedded MCP capabilities within the cloud-native API gateway. However, from the perspective of product version differentiation, these capabilities remain part of the AI gateway's functionality.

Dynamic Discovery of MCP Servers

The previous discussion introduced MSE Nacos as the registration/configuration center for MCP Servers. How can MCP Clients discover them? If MCP Clients directly interact with MSE Nacos, the Nacos SDK would also need to be included within the MCP Client, increasing coding complexity.

Given that deep integration has already been established between the cloud-native API gateway and MSE Nacos in the traditional services domain, facilitating automatic discovery and registration of services within MSE Nacos, we have similarly achieved automatic discovery and registration for MCP Servers residing in MSE Nacos under the MCP paradigm.

Through this method, MCP Clients only require access points from the cloud-native API gateway to dynamically and automatically obtain all registered MCP Servers in MSE Nacos. Consequently, the cloud-native API gateway (MCP gateway) transforms into a MCP Hub; regardless of updates or changes to MCP Servers, only operations in MSE Nacos are needed, with MCP Clients requiring no modifications.

Zero-Code Transformation of Traditional Services into MCP Servers

In the AI era, I believe the most valuable aspect is enhancing and elevating existing customer businesses into AI applications or AI-enhanced services rather than developing entirely new AI applications.

Thus, developing an AI application or enhancing an existing service requires interaction with numerous current business processes. Although MCP offers a unified protocol, restructuring existing businesses into MCP Servers entails high costs, and currently supported programming languages are limited, where languages like Go and PHP lack corresponding MCP SDKs. This leads many enterprises to hesitate in embracing MCP.

The gateway is best equipped to handle transformations of protocols. Many existing traditional services have already been registered in Nacos under classic microservice scenarios. Therefore, by utilizing this integration, conventional services registered in Nacos can be transformed into MCP Servers through zero-code changes.

The existing business services registered in MSE Nacos (SpringCloud services, Dubbo services, Go services) require no modifications.
A new configuration file adhering to the [Server Name]-mcp-tools.json naming format will be added in MSE Nacos, wherein the MCP specifications will describe the interfaces of existing business services.
Through the cloud-native API gateway (MCP gateway), MCP Clients can automatically discover the MCP Servers transformed from traditional services.

Conversion of SSE to Streamable HTTP

The default transmission protocol for MCP is SSE (Server Sent Event), which is essentially a stateful long-connection protocol. This kind of protocol poses many disadvantages in enterprise applications:

No support for resumability: After disconnecting, the client must restart the entire session.
The server is required to maintain long-term connections: it must remain highly available to support persistent SSE connections.
SSE only supports server → client messages, lacking flexibility for bidirectional communication.
Currently, only a few client/server architecture clients and MCP-provided web clients support the MCP paradigm and SSE protocol, making it unsuitable for enterprise production products.

Fortunately, the MCP strategic team has recognized these issues, leading to the new Streamable HTTP protocol release in late March. Streamable HTTP modifies the data transmission method of MCP, making the protocol more flexible, usable, and compatible:

More flexible: It supports streaming transmission without being mandatory.
More usable: It accommodates stateless servers.
More compatible: It is applicable to standard HTTP infrastructures.

In simple terms, the previous MCP transmission method imposes a requirement of remaining online (SSE necessitates maintaining a long connection) during your interaction with customer service; whereas the new method allows you to send a message anytime and await a response (ordinary HTTP requests can include streaming).

Consider the following:

Streamable HTTP eliminates barriers present in the current several C-end MCP Clients. This implies that any requester (even a simple HTTP Request code) can interact with MCP Servers similarly to how they request standard HTTP APIs.
In other words, once the interaction is feasible via standard HTTP API formats, is the notion of MCP Clients still valid?

Although Streamable HTTP is still in the draft phase, the cloud-native API gateway has implemented the ability to automatically convert the SSE transmission protocol into Streamable HTTP protocol. In other words, MCP Servers accessed through the cloud-native API gateway (MCP gateway) can support both SSE and Streamable HTTP as transmission protocols for Clients.

Identity Authentication and Permission Management under the MCP Paradigm

Identity authentication and permission control are imperative in any architecture and business context, as it is in the MCP paradigm. Here, two levels of permission management exist:

Clients have the right to use specific MCP Servers and which MCP Tools within them.
Clients have the right to access specific data through MCP Tools.

Usage Permissions for MCP Server and MCP Tool

Imagine that existing businesses can be transformed into MCP Servers with zero coding changes; the numerous registered MCP Servers and Tools within Nacos would span multiple domains, possibly including financial-related MCP Servers, sales-related MCP Servers, and after-sales service MCP Servers. They could not simply return all MCP Server and Tool information; only those to which the Client has access must be returned when relaying MCP Server and Tool details.

As the MCP gateway, the cloud-native API gateway facilitates the HTTP Basic Auth, OAuth2.0, JWT, API Key, external authentication, and multiple authentication options through a mature plugin mechanism, enabling users to flexibly manage and control Client identity authentication and MCP Server/MCP Tool usage permissions.

Data Permissions for MCP Server and MCP Tool

When MCP Servers are data-type services, such as MySQL MCP Server or Redis MCP Server, permissions can extend down to the database and table levels. In such scenarios, as the MCP gateway, the cloud-native API gateway can employ plugin mechanisms to rewrite or add values to Request Header and, in tandem with MSE governance, pass those Header values through, thus further executing data permission control within the services.

Let me offer an example of employing this approach for database read/write separation:

How to Rapidly Construct MCP Servers

It is well-known that AI applications involving LLM reasoning often operate in relatively sparse call scenarios; the MCP paradigm heavily relies on LLM inference. Consequently, whether the AI application development architecture adopts HTTP API or MPC paradigms, they primarily function in relatively sparse call contexts. This leads to two pressing questions:

In sparse call scenarios, how can computational resources for running MCP Servers optimize resource utilization, candidly speaking, attaining cost efficiency?
How can MCP Servers be constructed quickly in new businesses?

Among all computing products, Function Compute (FC), a Serverless FaaS computing product performs ideally regarding resource granularity, elastic strategies, and efficiency in sparse call contexts.

Function Compute (FC) currently supports MCP operating environments for Python and NodeJS (operating environments for other programming languages will soon follow). After selecting the MCP operating environment and creating a function, users need only compose the business logic for MCP Tools without worrying about employing the MCP SDK. Furthermore, the deep integration of cloud-native API gateways and Function Compute (FC) facilitates the natural alignment with the new paradigm for AI application development.

Elastic Efficiency of MCP Servers

MCP Servers built on Function Compute (FC) can be viewed through two dimensions concerning elastic efficiency:

Granular control over resource specifications.
Full elasticity based on user requests.

Function Compute (FC) instances range from 0.05C 128MB to 16C 32GB with dozens of combinational specifications available for flexible selection according to the business carried by each MCP Server. Additionally, in AI applications, particularly those composed of processes, most AI Agents' responsibilities consist of singular, computationally simple tasks, allowing them to be supported by smaller resource specifications. Reduced specifications allow for advantages in resource scheduling and lively efficiencies.

Moreover, Function Compute (FC) operates purely on request-based elasticity; instances scale according to query performance, and instances can be reused. Upon a drop in query performance, idle instances automatically release without requiring intervention from the user. In default request-based elasticity mode, users may specify strategies for timed scaling or threshold metrics to further accommodate diverse and variable business contexts, achieving optimal resource-cost balance.

Observability of MCP Servers

Function Compute (FC) features a comprehensive observability system, meaning MCP Servers constructed on Function Compute (FC) also possess observability capabilities across the dimensions of metrics, traces, and logs.

This observability system allows users to clearly ascertain the operational status of each MCP Server.

How to Resolve Pain Points of Open-source Self-built Dify

Currently, Dify has become the most widely used tool for visual process orchestration in AI Agents. However, no public cloud vendors offer Dify hosting products. Hence, many clients utilizing self-built open-source Dify platforms encounter a plethora of common issues, especially when transitioning from individual developers or demo development to enterprise-grade production application construction. These issues can often be fatal.

Common problems faced by enterprises based on self-built open-source Dify:

Weak traffic protection: Without any protective measures, self-built solutions are susceptible to breaches.
Tight coupling of controls with data pipelines: AI application designs could combine with Agent execution, posing stability challenges in high-concurrency scenarios.
Load balancing issues: Under heavy traffic conditions, core Dify services may experience reduced stability due to uneven traffic loading.
Lack of observability: Open-source Dify does not come with observability capabilities, requiring additional setup for such systems.

To address these issues, the Serverless PaaS type computing product on Alibaba Cloud, Serverless Application Engine (SAE), has devised an enterprise-grade Dify hosting deployment solution aimed at resolving the above issues, allowing businesses to utilize Dify without concerns over stability, robustness, or performance.

Quick Deployment of Dify

SAE provides Dify application templates enabling one-click instantiation of Dify applications, along with visual building capabilities to adjust each aspect within Dify individually.

Ensuring Stable, High Availability for Dify

SAE’s Dify deployment supports configurations, multi-AZ deployments, automated instance migrations according to specifications, and collaborates with the cloud-native API gateway alongside inbuilt service governance to ensure load balancing and stability. It also includes health checks for the six core Dify services and facilitates lossless online/offline control.

Additionally, leveraging its underlying Serverless architecture, applications deployed in SAE exhibit outstanding horizontal scaling efficiencies while permitting multiple forms of elasticity rule configurations, ensuring that the entire Dify service scales based on varying business scenarios, maintaining high availability while also enhancing cost advantages.

Moreover, SAE includes functionalities like low traffic warming and CPU bursts to further assure stable Dify applications in extreme conditions.

Dify Scheduling Solutions

Using scheduled workflows for AI data processing is a common business scenario, and Dify’s official documentation provides best practices for scheduled tasks executed via Dify workflows, with status monitoring available at: https://docs.dify.ai/zh-hans/learn-more/use-cases/dify-schedule. However, this practice involves a rudimentary Dify Schedule where task scheduling merely runs public Dify workflows through Github Actions, thus lacking an enterprise-grade solution.

Open-source Dify’s scheduling pain points predominantly include three factors:

Excessive execution records may lead to slow queries.
- Execution history is stored in databases; if oversaturated, performance degrades leading to slow queries.
Execution record queries do not support conditional filtering.
- For instance, querying by time intervals or by task statuses is a common requirement that the open-source Dify does not accommodate.
No monitoring or alerts.
- Task scheduling systems necessitate monitoring the execution of workflows; should workflows fail, alerts should be sent to respective personnel. The open-source solution lacks any alerting capabilities.

Our solution is to utilize MSE Task Scheduling (SchedulerX) to address these concerns.

Users configure the Dify endpoint in MSE Task Scheduling, pulling the workflow applications through the Dify API.
Users set up scheduled execution and alert monitoring through MSE Task Scheduling.
When scheduling Dify workflows, MSE Task Scheduling utilizes the API provided by Dify to schedule the user’s Dify applications and retrieves execution results and details in real-time, storing them in MSE’s AI task scheduler.
Employing AI task scheduling allows for alert monitoring and enhanced observability.

The MSE Task Scheduling integrated Dify solution has the following seven advantages over the open-source solutions:

Functionality	MSE Task Scheduling + Dify	Open-source Dify
Scheduled Execution	Yes	No
Monitoring & Alerts	Yes	No
Retention Duration for Execution Records	Retained for past two months	No limit, but excessive data can severely degrade performance
Query Execution Records	Supports time intervals, statuses, and various query conditions	Limited filtering conditions
Permission Management	Fine-grained, operation-level permission management	User-level permissions
Traffic Limiting	Application-level limiting, token limiting	None
Automatic Retry on Failure	Yes	No

AI Application Observability System

By integrating Ali Cloud’s observability product ARMS and trace trackingOpenTelemetry, we construct a comprehensive observability system for AI applications.

The overall AI application observability system construction revolves around two core components:

Data collection.
Data linkage and analysis.

Observational Data Collection

The crux of data collection is broad coverage, which bifurcates into two facets:

The support for a wide range of programming languages and development frameworks.
The cloud products involved in the new AI application architecture should report data adhering to the same standards.

Through these facets, we achieve comprehensive coverage via Alibaba Cloud’s application monitoring product ARMS combined with trace tracking OpenTelemetry:

Adhering to the latest OpenTelemetry community GenAI semantic conventions.
Support for popular AI frameworks and models, including Spring AI Alibaba / LLamaIndex / Langchain / Tongyi Qianwen 2 / OpenAI / PromptFlow, etc.
Support for mainstream programming languages in AI application development, such as Python, Java, and Go, while providing more finely-tuned tracking and attributes than community standards.
Facilitate the dissemination of session information across different call chains.
The cloud-native API gateway supports the OpenTelemetry protocol, and both the gateway itself and its plugins report observational data based on OpenTelemetry.
Function Compute FC and Serverless Application Engine SAE have deeply integrated with both application monitoring ARMS and OpenTelemetry trace tracking products.

Data Linkage and Analysis

Within the application monitoring ARMS, a monitoring module specifically designed for LLM applications has been developed, providing a complete observability system tailored for AI scenarios.

Vertical metrics include:

Number of online AI applications.
Number of traces.
Number of spans.
Number of large models.
Token usage statistics.
Number of sessions.
Number of users.
Number of model invocation.
Token consumption metrics.
Model invocation durations.
Rank of token consumption.
Etc…

In terms of horizontal linkage, specialized capabilities for call chain analysis are offered:

Span lists.
Trace lists.
Scatter plots.
Full linkage aggregation.
Overall call chain topology.
Error/slow trace analysis.
Each link along the call chain exhibits input, output, and token consumption analytics.

Additional Planned Functions on the Horizon

Transforming Dify DSL to Spring AI Alibaba Code

Although Dify offers substantial convenience for developing AI Agents, limitations concerning its programming language (Python) and workflow engine's implementation logic can impair performance in complex AI applications. Consequently, we are investigating automatically converting Dify workflow DSL into code based on the Spring AI Alibaba development framework.

This ensures that Dify’s low-code visual constructs attract users, while the underlying kernel hinges on code from the Spring AI Alibaba development framework, enabling both the convenient orchestration of AI Agents and enhanced runtime performance.

LLM-Driven Orchestration of MCP Servers

Under the current MCP paradigm, LLM returns a definitive MCP Server and Tool in response to user prompts; this is primarily governed by system prompts. In theory, LLMs could return multiple MCP Servers and Tools in response to user queries, indicating their invocation sequence based on the descriptions of MCP Servers and Tools, inherently allotting LLMs the orchestration responsibility for MCP Servers. This approach is still under exploration and closely resembles current Multi-Agent models.

Enhancing Performance in MCP Mode

Given that the MCP mode frequently interacts with LLMs, it is evident that it lags behind traditional API calling in performance. Currently, in scenarios sensitive to latency, MCP mode may not be suitable.

We are also examining and exploring enhancements for the request performance issues inherent in the MCP model. For instance:

Firming up MCP Server/MCP Tool combinations to reduce interactions with LLMs. Once LLM orchestration of MCP Servers is realized, interaction with LLMs may solely exist during development or debugging stages, with cloud formations relying on established MCP Server and Tool invocation relations.
Function computing explores edge cases by running MCP Servers closer to users.

Impact of the New Paradigm for AI Application Architecture on Enterprises

At this point, the introduction of the enterprise-grade AI application architecture paradigm comes to a close. There are many segments within the overall structure, with numerous details that cannot be fully elaborated in this article. Interested individuals are encouraged to contact me for further discussion.

Let’s consider the potential changes in organizational structures and collaborative relationships among operations, product, R&D, and operations teams under this new AI application architecture paradigm. What transformations might occur in application or system development modes?

Here, I wish to share my some thoughts.

MCP Server First

API First and back-end separation have been around for quite a while, with overseas enterprises usually adhering to and effectively implementing them. Having worked deeply within the Serverless computing domain for five years, I have conducted substantial research on AWS’s Lambda architecture, Azure Functions architecture, Azure App Service architecture, GCP CloudFunction architecture, and GCP CloudRun architecture, engaging with quite a few customers to develop numerous Serverless FaaS and PaaS architectural cases, including the successful migration of various clients from Dual A to Alibaba Cloud. Most of these scenarios follow a standard API Gateway + FaaS model exhibiting an API First shape. However, domestically, this model has not been well practiced. Aside from the significant decision made by Amap when it entirely transitioned system structure to utilize Function Compute, achieving true API First along with back-end separation, there exist very few other customers who have practiced this model perhaps due to enduring historical challenges.

The above image shows architectural comparisons for Amap before and after.

In the era of AI applications, the essence still lies in the invocation of various APIs, but transforming HTTP APIs to REST APIs carries vast costs. Nevertheless, as MCP emerges, our plan to assist customers in transitioning to the new paradigm for AI application architecture with zero code transformation demonstrates the possibility of MCP Server First.

The operations team: Responsible for the maintenance of cloud products (e.g., cloud-native API gateways, MSE Nacos, Serverless application engines, PAI, etc.), maintenance of the observability system (also based on cloud products), and maintaining ongoing communication with cloud vendors.
The R&D team: Understands the atomic capabilities of the company’s business and is responsible for constructing the MCP Server pool.
The operations/marketing/product teams: Facilitate the construction of business processes (business orchestration) using low-code visual approaches, articulating business requirements in layman’s terms to quickly complete the assembly of business processes or the establishment of AI applications.

Therefore, it’s quite likely that each enterprise may possess its own MCP Server market, categorized accordingly, with each class of MCP Server managed by specialized teams. Developers need not be overly concerned with standardized return formats or uniform programming languages. When operations, marketing, product, or other business segments have business requirements or new product functionality requests, they can swiftly build AI applications through a unified interface employing layman’s terms, leveraging MCP and LLM for business orchestration, thereby realizing a new development model termed PRD as Product.

Contact

Follow and engage with us through the following channels to stay updated on the latest developments from higress.ai.

https://medium.com/@higress_ai