Scene Description
The gateway optimizes performance through:
Precision Caching: Reduces token consumption and latency by caching repetitive/similar queries.
Semantic Context Caching: Stores LLM response contexts in-memory, automatically injecting historical dialogues into subsequent prompts to enhance contextual understanding.
Pratice Description
The AI gateway optimizes inference latency and costs by caching LLM responses in an in-memory database through gateway plugins. It automatically stores historical dialogues of corresponding users at the gateway layer and populates them into the context during subsequent conversations, thereby enhancing the large model's understanding of contextual semantics.