I've organized it for you: 8 common application scenarios of AI gateways

Ji yuan

Mar 7, 2025

Share on X

Previously, we shared "The Top 10 Essential Capabilities an AI Gateway Must Have," from the provider's perspective. Today, we will introduce eight common application scenarios of AI gateways from the consumer's perspective. Since most enterprises currently deploy large models mainly for internal use, the application scenarios listed in this article primarily stem from internal needs, relying more on AI gateways in terms of requirement granularity and extent when providing services externally.

01 Multi-Model Services

There is no single dominant model; enterprises often implement multi-model strategies. Employees can choose large models at the front end and freely switch between different large model services at the back end. For instance, enterprises can deploy multiple large models such as DeepSeek, Qwen, self-built models, etc., for employees to choose from to achieve more diverse and selectable generation effects. The more diversified an enterprise is, the stronger the demand for multi-model services.

Demand scenarios:

Multi-modal business integration scenarios where enterprises need to process multi-modal data such as text, images, audio, and 3D simultaneously. Research and product teams often need models with strong reasoning capabilities; customer service, marketing, and graphic design teams have high demands for image large models; industrial design and film production teams require large models for audio and video.
Enterprises covering multiple vertical fields need to invoke dedicated models based on different industry characteristics. Especially for enterprises at the supply chain end, which often serve multiple industries, there may be a demand for several large models for specific vertical industries.
Complex task collaboration scenarios where a single task requires multiple models to collaborate to enhance effectiveness. Multiple large models must work together to achieve the best results in content generation.
Scenarios with dual demands for safety and efficiency, such as in medical institutions where patient data is analyzed using dedicated private models, while unrelated requests are handled using general models, avoiding the mixing of sensitive and non-sensitive data when writing to the database.

Response: The AI Gateway provided by Alibaba Cloud's cloud-native API gateway supports switching between different backend models based on model names, enabling one interface to connect with various large model services. These large model services can be deployed on different platforms, such as BaiLian, PAI, IDC self-built models; even if different models belong to various development and operation teams, there are no collaboration costs.

02 Consumer Authentication

Demand scenarios:

Multi-tenant model service rental scenarios: When enterprises provide shared large model services for different departments or teams, they differentiate tenants through API Keys to ensure data isolation and permission management. Specific requirements include:

Assigning independent API Keys for each tenant to control calling permissions and resource allocation, for example, department A's calling resource allocation is 20 times per person per day, while department B's is 30 times per day.
Supporting tenant-defined model parameters (such as temperature coefficients, output length), but requiring validation through the gateway for permission.

Internal permission level management: Different roles within the enterprise need differentiated access to modeling capabilities. Specific requirements include:

Using RBAC (Role-Based Access Control) to restrict sensitive features (like model fine-tuning and data exporting).
For cost considerations, multi-modal large models are only available for the design department.
Logging operation records linked to user identities to meet internal auditing requirements. For instance, financial enterprises restrict the risk assessment model to only be accessible by the risk control department to prevent misuse by ordinary employees.

Implementation plan:

The AI Gateway provided by Alibaba Cloud's cloud-native API gateway supports routing configuration authentication and consumer authentication, controlling API access, ensuring security and policy management. Through the generation, distribution, authorization, enabling authorization, and verification processes of API Keys, it ensures that only authorized requests can access services.

Trusted identity: Ensuring the requester is a registered/authorized user or system.
Risk interception: Preventing malicious attacks, illegal calls, and resource abuse.
Compliance assurance: Meeting data security regulations and enterprise auditing requirements.
Cost control: Achieving precise billing and API quota management through authentication.

03 Automatic Model Switching

Demand scenarios:

Exceptions caused by the model's own characteristics: Large models generate results with probabilistic fluctuations, leading to unstable random outputs; traffic loss due to new version releases.
Exceptions caused by improper user usage: User request parameters not conforming to API specifications, leading to connection timeouts or interruptions, or inputs containing maliciously constructed prompts that trigger the model's safety protection mechanisms, returning empty results or error codes.
Resource and performance limitations: Excessive request frequency triggers rate-limiting strategies, rendering services unavailable, while long requests occupy too much memory, causing subsequent requests to be blocked and ultimately leading to timeouts.
Dependency service failures: External APIs, such as databases accessed by RAG retrieval, becoming unavailable, causing the model to be unable to acquire necessary contexts.

Implementation plan: The AI Gateway provided by Alibaba Cloud's cloud-native API gateway supports falling back to specified other large model services after a request to a specific large model service fails, ensuring service robustness and continuity.

04 Token-Level Rate Limiting

Demand scenarios:

Although the enterprise is used internally and does not often have simultaneous concurrency demands, setting rate-limiting capabilities can configure hardware resources more economically. For example, a company with 10,000 employees does not need to configure hardware resources to support 10,000 users online at once but only needs to set up resources for 7,000 people, while limiting the excess to avoid resource idleness. Other requirements include:

Improving resource management: Large models consume computation resources uncontrollably; rate limiting can prevent system overload, ensuring all users can obtain stable performance, especially during peak periods.
Layered specification of users: Rate limiting can be applied based on ConsumerId or API Key.
Preventing malicious usage: By limiting the number of tokens to reduce junk requests or attacks to avoid resource loss.

Response plan:

The AI Gateway provided by Alibaba Cloud's cloud-native API gateway includes the ai-token-ratelimit plugin, implementing token rate limiting based on specific key values derived from URL parameters, HTTP request headers, client IP addresses, consumer names, and cookie key names.

05 Content Security and Compliance

Demand scenarios:

The enterprise is a serious work environment; self-built large models need to ensure the safety and compliance of generated content, including filtering out harmful or inappropriate content, detecting and preventing requests containing sensitive data, and auditing the quality and compliance of AI-generated content.

Sensitive data processing in the financial industry: Auditing user's financial transaction instructions and investment consultation contents, preventing fraud, money laundering, and other violations.
Compliance checks for financial reports and risk assessment results generated by the model.
Medical health information interaction: Electronic medical record generation content, preventing the leakage of patient privacy (such as ID numbers, diagnosis records), ensuring that AI-generated medical advice complies with related regulations. Using multi-modal large models to identify sensitive information in medical images and combining with compliance rule databases for automated interception.
Social media and UGC content management: Real-time audits of the images, texts, and videos published by users, intercepting pornographic, violent, or false information. Compliance checks on AI-generated recommendation content (like short video titles and comments).
Government service platform interaction: Auditing public inquiries submitted to governmental services to prevent malicious attacks or the dissemination of sensitive information, ensuring that AI-generated policy interpretations and service guidelines comply with relevant regulations.
E-commerce and live streaming platform risk control: Auditing product descriptions and live streaming bullet comments, intercepting false advertising and banned items' information, and ensuring compliance in marketing copies generated by AI.

Response plan:

The AI Gateway provided by Alibaba Cloud's cloud-native API gateway has integrated Alibaba Cloud's content security, providing auditing services for input instructions and generated texts aimed at large language models, including:

Preventing attacks: Validating inputs can stop malicious prompt injections, preventing the model from generating harmful content.
Maintaining model integrity: Avoiding input manipulations that could lead to erroneous or biased outputs.
User safety: Ensuring the output contains no harmful or misleading content, protecting users from adverse effects.
Content moderation: Filtering out inappropriate content, such as hate speech or vulgar language, especially in public applications.
Legal compliance: Ensuring outputs meet legal and ethical standards, especially in the medical or financial fields.

06 Semantic Caching

Demand scenarios:

Pricing for large model API services is divided into X yuan per million input tokens (cache hit) / Y yuan (cache miss), with X being much lower than Y. For example, in the case of the Common Sense series, X is only 40% of Y. By caching large model responses in an in-memory database and using gateway plugins to improve inference delays and costs, the gateway layer can automatically cache historical dialogues of corresponding users and automatically fill in for context in subsequent dialogues, thus achieving large models' understanding of context semantics. For example:

High-frequency repetitive query scenarios: In customer service systems, intelligent assistants, etc., users often raise repeated questions (like "How to reset my password" or "Refund process"), caching common questions and their answers avoids repeated model calls, lowering call costs.
Fixed context multiple calls scenarios: Analyzing legal documents (like interpreting contract terms), analyzing educational materials (like knowledge point Q&A), where the same long text is questioned multiple times. Caching the context avoids redundant data transmission and processing, improving response times and reducing call costs.
Complex computation results reuse scenarios: In data analysis and generation scenarios (like financial report summaries, research report generation), caching the results of multiple analyses on the same dataset avoids redundant calculations.
RAG (Retrieval-Augmented Generation) scenarios: Caching knowledge base retrieval results (like internal FAQs), accelerating responses to subsequent similar queries.

Response plan:

The AI Gateway provided by Alibaba Cloud's cloud-native API gateway offers extension points to cache request and response contents in Redis and supports configuring Redis service information and setting cache times.

07 Online Search + Accessing Full Text of Retrieved Web Pages

Demand scenarios:

Online search has become a standard feature for large models. If online search is not supported or can only retrieve webpage titles, summaries, and keywords rather than full texts, the content generation effect will be significantly reduced.

Response plan:

The AI Gateway provided by Alibaba Cloud's cloud-native API gateway has realized online search + accessing full text of retrieved websites through the following enhancements:

LLM Rewriting Query: Based on LLM identifying user intent, generating search commands can significantly enhance search effect.
Keyword Refinement: Different engines require generating different prompts, for example, Arxiv has many English papers, so keywords need to be in English.
Domain Recognition: Taking Arxiv as an example, Arxiv categorizes into different subfields under computer science, physics, mathematics, biology, etc. Searching within a designated domain can improve search accuracy.
Long Query Splitting: Long queries can be split into multiple short queries, improving search efficiency with high-quality data: Google/Bing/Arxiv searches can only output article abstracts, while connecting Alibaba Cloud's information retrieval service, IQS, to Quark search allows for full text retrieval, improving the quality of LLM generated content typical application scenario effect display.

08 Large Model Observability

Demand scenarios:

Observability is commonly found in cost control and stability scenarios. Due to the more sensitive and fragile resource consumption of large model applications compared to web applications, the demand for observability in cost control is even stronger. A lack of comprehensive observability capabilities can lead to abnormal calls that may cost tens of thousands or even hundreds of thousands in resource losses.

In addition to traditional observability indicators such as QPS, RT, and error rates, large model observability should also include:

Token consumption statistics based on consumers.
Token consumption statistics based on models.
Rate limiting indicators: How many requests are intercepted due to rate limiting within a unit time, statistics on rate-limited consumers (which consumers are being rate-limited).
Cache hit situations.
Safety statistics: Risk type statistics, risk consumer statistics.

Response plan:

Alibaba Cloud API Gateway supports viewing gateway monitoring data, enabling log delivery in the cloud-native API gateway, link tracing, and how to view monitoring data for REST APIs and interfaces through the cloud-native API gateway. These features will help you manage and optimize interface performance more efficiently, while also enhancing overall service quality. Additionally, through SLS, it is possible to aggregate Actiontrail events, cloud product observable logs, LLM gateway detail logs, detailed dialogue detail logs, Prompt Trace, and real-time inference call detail logs, thus building a complete and unified observability solution.

Contact

Follow and engage with us through the following channels to stay updated on the latest developments from higress.ai.

https://medium.com/@higress_ai