APISIX × 大模型流量网关

APISIX 不仅能结合，还正在成为 LLM Gateway 的主流选择之一 —— 管理 AI 流量的鉴权、限速、路由、可观测性

为什么 LLM 需要专属网关？

传统 API 网关处理：
  请求进来 → 转发 → 响应回去
  耗时：毫秒级，响应体：KB级

LLM API 的特殊性：
  ├── 响应时间长（5~60秒）
  ├── 流式输出（SSE / Streaming）
  ├── Token 计费（不是按次，按用量）
  ├── 多模型路由（GPT-4 / Claude / 通义 / 本地模型）
  ├── 成本极高（需要精细配额管理）
  └── 需要 Prompt 审计 / 内容安全

普通网关不够用，需要"懂AI"的网关

整体架构

┌─────────────────────────────────────────────────────────┐
│                     客户端层                             │
│        Web App    移动端    智能体    内部服务            │
└─────────────────────────┬───────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────┐
│                  APISIX LLM Gateway                     │
│                                                         │
│  ┌───────────┐  ┌───────────┐  ┌─────────────────────┐  │
│  │  鉴权认证  │  │  流量控制  │  │   模型路由选择        │  │
│  │ JWT/Key   │  │Token限速  │  │ GPT/Claude/通义/本地 │  │
│  └───────────┘  └───────────┘  └─────────────────────┘  │
│                                                         │
│  ┌───────────┐  ┌───────────┐  ┌─────────────────────┐  │
│  │ Token计量  │  │ 流式代理  │  │   Prompt 审计        │  │
│  │ 用量统计   │  │ SSE透传   │  │   内容安全过滤        │  │
│  └───────────┘  └───────────┘  └─────────────────────┘  │
│                                                         │
│  ┌───────────┐  ┌───────────┐  ┌─────────────────────┐  │
│  │ 语义缓存   │  │ 故障转移  │  │   可观测性            │  │
│  │ Redis向量  │  │ 降级备用  │  │ Prometheus/Grafana  │  │
│  └───────────┘  └───────────┘  └─────────────────────┘  │
└─────────────────────────┬───────────────────────────────┘
                          ↓
          ┌───────────────┼───────────────┐
          ↓               ↓               ↓
   ┌────────────┐  ┌────────────┐  ┌──────────────┐
   │  OpenAI    │  │  通义千问   │  │  本地 Ollama  │
   │  GPT-4o    │  │  Qwen-Max  │  │  DeepSeek-R1 │
   └────────────┘  └────────────┘  └──────────────┘

核心能力实现

1. 🔑 多租户鉴权 + 配额管理

# 为不同业务方创建 Consumer，分配不同配额
curl -X PUT http://localhost:9180/apisix/admin/consumers \
  -d '{
    "username": "business_unit_A",
    "plugins": {
      "key-auth": {
        "key": "biz-a-api-key-xxxx"
      },
      "limit-count": {
        "count": 100000,
        "time_window": 86400,
        "key": "consumer_name",
        "rejected_code": 429,
        "rejected_msg": "每日Token配额已用尽"
      }
    }
  }'

不同客户 → 不同 API Key → 不同配额策略

业务A（付费版）：100万 Token/天
业务B（基础版）：10万 Token/天
内部测试：无限制

2. 🔀 多模型智能路由

# 路由规则：根据请求头选择模型
curl -X PUT http://localhost:9180/apisix/admin/routes/llm-router \
  -d '{
    "uri": "/v1/chat/completions",
    "plugins": {
      "key-auth": {},
      "proxy-rewrite": {
        "headers": {
          "set": {
            "Authorization": "Bearer $ENV_OPENAI_KEY"
          }
        }
      }
    },
    "upstream": {
      "type": "roundrobin",
      "nodes": {"api.openai.com:443": 1},
      "scheme": "https"
    }
  }'

-- 自定义 Lua 插件：根据条件动态路由
local _M = {
    name = "llm-router",
    priority = 1500,
}

function _M.access(conf, ctx)
    local model = get_request_model(ctx)    -- 从请求体读取 model 字段
    local user_tier = get_user_tier(ctx)   -- 从 Consumer 获取用户等级
    
    -- 路由策略
    if model == "gpt-4" and user_tier == "premium" then
        ctx.upstream_host = "api.openai.com"
        ctx.upstream_uri  = "/v1/chat/completions"
        
    elseif model == "gpt-4" and user_tier == "basic" then
        -- 降级：basic 用户的 gpt-4 请求 → 路由到 gpt-3.5
        core.request.set_body({model = "gpt-3.5-turbo"})
        ctx.upstream_host = "api.openai.com"
        
    elseif model:find("qwen") then
        ctx.upstream_host = "dashscope.aliyuncs.com"
        ctx.upstream_uri  = "/compatible-mode/v1/chat/completions"
        
    elseif conf.prefer_local then
        -- 优先路由到本地模型（省成本）
        ctx.upstream_host = "ollama.internal:11434"
    end
end

3. 💰 Token 用量计量（按量计费核心）

-- token-metering 插件
-- 在响应阶段读取 usage 字段，记录 Token 消耗

function _M.body_filter(conf, ctx)
    -- 收集完整响应体
    local body = get_full_body(ctx)
    local ok, resp = pcall(json.decode, body)
    if not ok then return end
    
    local usage = resp.usage
    if usage then
        local consumer = ctx.consumer_name
        local prompt_tokens     = usage.prompt_tokens     or 0
        local completion_tokens = usage.completion_tokens or 0
        local total_tokens      = usage.total_tokens      or 0
        
        -- 写入 Redis 累计用量
        local redis_key = "token_usage:" .. consumer .. ":" .. get_today()
        redis_client:incrby(redis_key, total_tokens)
        redis_client:expire(redis_key, 86400)
        
        -- 推送到 Kafka 做账单分析
        kafka_producer:send("llm-usage-events", json.encode({
            consumer    = consumer,
            model       = ctx.var.model_name,
            prompt_t    = prompt_tokens,
            completion_t= completion_tokens,
            total_t     = total_tokens,
            cost_usd    = calc_cost(ctx.var.model_name, total_tokens),
            timestamp   = ngx.time(),
            request_id  = ctx.var.request_id
        }))
    end
end

Token 计量数据流：

请求完成
   ↓
读取 response.usage.total_tokens
   ↓
Redis 累计（实时配额判断）
   ↓
Kafka 异步（账单/分析/告警）
   ↓
ClickHouse（数仓，月度账单）
   ↓
Grafana 可视化（费用趋势）

4. 🌊 流式响应透传（SSE）

-- LLM 流式输出是 SSE（Server-Sent Events）格式
-- 数据格式：
-- data: {"id":"xxx","choices":[{"delta":{"content":"你"}}]}
-- data: {"id":"xxx","choices":[{"delta":{"content":"好"}}]}
-- data: [DONE]

-- APISIX 需要特殊处理：

function _M.access(conf, ctx)
    -- 关键：关闭响应缓冲，实现真正的流式透传
    ngx.arg[2] = false   -- 不要等完整响应再转发
    
    -- 设置 SSE 相关响应头
    ngx.header["Content-Type"]  = "text/event-stream"
    ngx.header["Cache-Control"] = "no-cache"
    ngx.header["X-Accel-Buffering"] = "no"  -- 禁用 Nginx 缓冲
end

function _M.body_filter(conf, ctx)
    -- 流式计量：边流边统计 Token
    local chunk = ngx.arg[1]
    if chunk then
        -- 解析每个 SSE chunk，累计 token 数
        accumulate_stream_tokens(ctx, chunk)
    end
    
    -- 检测流结束
    if ngx.arg[2] then  -- 最后一个chunk
        finalize_token_count(ctx)
    end
end

5. 🧠 语义缓存（节省成本神器）

-- semantic-cache 插件
-- 相似的问题直接返回缓存，不消耗 Token

function _M.access(conf, ctx)
    local body = get_request_body(ctx)
    local messages = body.messages
    local query = extract_last_user_message(messages)
    
    -- 1. 对 query 做向量化（调用 Embedding 服务）
    local query_vector = embedding_service:encode(query)
    
    -- 2. 在 Redis Vector / ES 中做近似搜索
    local cached = vector_search(query_vector, threshold=0.95)
    
    if cached then
        -- 命中缓存！直接返回，不转发到 LLM
        ngx.header["X-Cache"] = "HIT"
        ngx.header["X-Cache-Similarity"] = cached.score
        
        core.response.exit(200, {
            id      = "cached-" .. ngx.now(),
            object  = "chat.completion",
            choices = cached.choices,
            usage   = {total_tokens = 0},  -- 缓存命中，不计费
            _cached = true
        })
        return
    end
    
    -- 未命中，请求继续转发到 LLM
    ctx.query_vector = query_vector  -- 传给 body_filter 存缓存
    ngx.header["X-Cache"] = "MISS"
end

function _M.body_filter(conf, ctx)
    -- 响应完成后，存入语义缓存
    if not ctx.query_vector then return end
    
    local resp = parse_response_body()
    vector_store:save({
        vector  = ctx.query_vector,
        choices = resp.choices,
        model   = resp.model,
        ttl     = conf.cache_ttl or 3600
    })
end

语义缓存效果：

"Python 怎么读文件？"    → 缓存 MISS → 调用 GPT → 存缓存
"Python 如何打开文件？"  → 相似度 0.97 → 缓存 HIT → 直接返回
"用 Python 读取文件"    → 相似度 0.96 → 缓存 HIT → 直接返回

节省：2次 LLM 调用，省约 70% Token 成本

6. 🛡️ Prompt 注入防护 / 内容安全

-- prompt-guard 插件

local INJECTION_PATTERNS = {
    "忽略之前的指令",
    "ignore previous instructions",
    "你现在是.+没有限制",
    "DAN mode",
    "越狱",
    "jailbreak",
}

local SENSITIVE_PATTERNS = {
    -- 防止泄露系统 Prompt
    "system prompt",
    "你的指令是什么",
    "重复你收到的所有内容",
}

function _M.access(conf, ctx)
    local body = get_request_body(ctx)
    local user_input = extract_user_messages(body.messages)
    
    -- 检测 Prompt 注入
    for _, pattern in ipairs(INJECTION_PATTERNS) do
        if user_input:match(pattern) then
            return 400, {
                error = "detected_prompt_injection",
                message = "请求包含不安全内容"
            }
        end
    end
    
    -- 调用内容安全 API（阿里云绿网/腾讯云天御）
    if conf.content_check then
        local is_safe, reason = content_safety_api(user_input)
        if not is_safe then
            -- 记录违规日志
            log_violation(ctx, user_input, reason)
            return 400, {error = "content_policy_violation"}
        end
    end
    
    -- 注入系统级约束（强制追加 system prompt）
    if conf.system_prompt_append then
        inject_system_constraint(body, conf.system_prompt_append)
    end
end

7. 🔄 故障转移 + 降级策略

# 配置多模型故障转移
curl -X PUT http://localhost:9180/apisix/admin/upstreams/llm-failover \
  -d '{
    "name": "llm-with-fallback",
    "type": "roundrobin",
    "nodes": {
      "api.openai.com:443": 10
    },
    "checks": {
      "active": {
        "https_verify_certificate": false,
        "http_path": "/v1/models",
        "interval": 10,
        "timeout": 5,
        "healthy":   {"successes": 2},
        "unhealthy": {"http_failures": 3, "http_statuses": [429, 500, 503]}
      }
    },
    "retries": 2,
    "retry_timeout": 30,
    "pass_host": "node"
  }'

-- 自定义降级逻辑
function _M.access(conf, ctx)
    -- OpenAI 超时或限速 → 自动切换到备用模型
    local primary_available = check_upstream_health("openai")
    
    if not primary_available then
        -- 降级到通义千问
        ctx.upstream = "qwen-upstream"
        ctx.model_override = "qwen-max"
        
        -- 通知客户端（透明降级）
        ngx.header["X-Model-Fallback"] = "true"
        ngx.header["X-Fallback-Reason"] = "primary_unavailable"
        
    elseif get_openai_queue_depth() > conf.max_queue then
        -- OpenAI 队列过长 → 部分流量切到本地 Ollama
        if math.random() < 0.3 then  -- 30% 流量
            ctx.upstream = "ollama-local"
            ctx.model_override = "deepseek-r1:7b"
        end
    end
end

完整生产架构图

外部调用方（App / Agent / 第三方）
            ↓ HTTPS
┌───────────────────────────────────────────────────┐
│              APISIX LLM Gateway 集群               │
│                                                   │
│  ① key-auth          → 身份验证                   │
│  ② rate-limit        → 防滥用（RPM/TPM 限制）      │
│  ③ prompt-guard      → Prompt 注入检测             │
│  ④ semantic-cache    → 语义缓存（命中直接返回）      │
│  ⑤ llm-router        → 智能模型路由                │
│  ⑥ proxy（流式透传）  → 转发到 LLM                  │
│  ⑦ token-metering    → Token 用量统计              │
│  ⑧ http-logger       → 审计日志                   │
└───────────────┬───────────────────────────────────┘
                │
    ┌───────────┼──────────────┬────────────────┐
    ↓           ↓              ↓                ↓
┌────────┐ ┌────────┐  ┌────────────┐  ┌─────────────┐
│OpenAI  │ │Claude  │  │ 通义/文心   │  │本地 Ollama   │
│GPT-4o  │ │Sonnet  │  │ Qwen-Max   │  │DeepSeek-R1  │
└────────┘ └────────┘  └────────────┘  └─────────────┘

            ↓ 数据流
┌────────┐  ┌────────┐  ┌──────────┐  ┌──────────────┐
│ Redis  │  │ Kafka  │  │Prometheus│  │  ClickHouse  │
│语义缓存│  │用量事件│  │  指标    │  │  费用账单     │
└────────┘  └────────┘  └──────────┘  └──────────────┘
                                            ↓
                                       Grafana 看板

Grafana 监控看板指标

LLM Gateway 核心监控指标：

┌─────────────────────────────────────────────┐
│  实时指标                                    │
│  ├── QPS（每秒请求数）                        │
│  ├── 平均响应时间（TTFB / 完整响应）           │
│  ├── 流式首Token延迟（TTFT）                  │
│  ├── 错误率（4xx / 5xx / 超时）               │
│  └── 缓存命中率                              │
│                                             │
│  Token 用量                                 │
│  ├── 实时 TPM（每分钟Token数）               │
│  ├── 按模型分类用量（GPT-4 vs Claude vs...） │
│  ├── 按租户分类用量                          │
│  └── 费用趋势（日/周/月）                    │
│                                             │
│  模型健康                                   │
│  ├── 各模型可用性                            │
│  ├── 429 限速频率（触发降级次数）             │
│  └── 故障转移触发次数                        │
└─────────────────────────────────────────────┘

与现有 LLM 网关产品对比

对比项	APISIX 自建	LiteLLM	PortKey	One API
性能	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐
定制灵活性	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
语义缓存	自行实现	✅ 内置	✅ 内置	❌
Token 计量	自行实现	✅ 内置	✅ 内置	✅
插件生态	⭐⭐⭐⭐⭐	⭐⭐	⭐⭐	⭐⭐
国内模型支持	自行配置	✅	⭐⭐	⭐⭐⭐⭐⭐
运维复杂度	高	低	低（SaaS）	低
适合场景	企业级定制	快速接入	云托管	国内多模型

💡 推荐策略：快速验证用 One API / LiteLLM，生产大规模用 APISIX 自建，需要极致定制用 APISIX + 自定义插件

总结

能力	实现方式	价值
多模型路由	自定义 Lua 插件	灵活调度，避免单点依赖
Token 计量	body_filter 解析 usage	精准计费，成本可控
语义缓存	Embedding + 向量搜索	节省 30~70% Token 成本
流式透传	禁用缓冲 + SSE 处理	用户体验流畅
Prompt 防护	正则 + 内容安全 API	安全合规
故障转移	健康检查 + 降级策略	高可用保障
可观测性	Prometheus + Kafka	全链路监控

核心价值：APISIX 作为 LLM Gateway，把 AI 能力变成可管理、可计费、可观测、可扩展 的企业级服务，是构建内部 AI 平台的基础设施底座。

如果觉得文章对你有用，请随意赞赏

APISIX × 大模型流量网关

为什么 LLM 需要专属网关？

整体架构

核心能力实现

1. 🔑 多租户鉴权 + 配额管理

2. 🔀 多模型智能路由

3. 💰 Token 用量计量（按量计费核心）

4. 🌊 流式响应透传（SSE）

5. 🧠 语义缓存（节省成本神器）

6. 🛡️ Prompt 注入防护 / 内容安全

7. 🔄 故障转移 + 降级策略

完整生产架构图

Grafana 监控看板指标

与现有 LLM 网关产品对比

推荐技术栈组合

总结