MaaS API

Google Generative AI API
- https://ai.google.dev/api/rest
- Google AI Studio, Gemini API
- BaseURL https://generativelanguage.googleapis.com/v1beta
- OpenAI BaseURL https://generativelanguage.googleapis.com/v1beta/openai - OpenAI API compatible
- 接口
  - /models/{model_id}:streamGenerateContent
  - /models/{model_id}:generateContent?key={key}
Google Vertex AI
- GCP, 企业级、全托管的机器学习 (ML) 平台
- Model Garden + MLOps
Anthropic
- https://platform.claude.com/docs/en/api/overview
- SDK
  - https://github.com/anthropics/anthropic-sdk-typescript
  - https://github.com/anthropics/anthropic-sdk-typescript/blob/main/src/resources/beta/messages/messages.ts
OpenAI
- https://platform.openai.com/docs/api-reference/chat/create
XAI
- https://docs.x.ai/docs/api-reference
https://www.postman.com/postman/anthropic-apis/documentation/dhus72s/claude-api
vLLM
- https://docs.vllm.ai/en/v0.10.2/api/vllm/entrypoints/openai/serving_completion.html
Provider
- https://cloud.google.com/vertex-ai/generative-ai/docs/partner-models/use-partner-models
错误码
参考
- https://www.openresponses.org/

openai	anthropic	google
parallel_tool_calls	disable_parallel_tool_use
max_completion_tokens	max_tokens

“长尾分布”
“突发性”
"Fat Tail" (肥尾)
3+Sigma + 15-30min 窗口检查异动

Gemini API

https://ai.google.dev/api/rest

Multiple tools are supported only when they are all search tools

内置 tool 和 functionDeclaration 工具不能同时使用
openai 里的 tool 映射为一个 functionDeclaration
其他的 tool 是内置 tool，语义上有点区别

OpenAI API

streaming

第一个chunk 有 role 没内容，之后的 chunk 有内容没有 role
最后的 chunk，usage 和 finish_reason 分开
stream_options
- continuous_usage_stat
  - 连续发送 usage
- include_usage
  - 最后一个 chunk 包含 usage
第一个 chunk 和最后一个 chunk 不应该包含 content
有些供应商在第二个 chunk 返回 role
参考
- https://github.com/BerriAI/litellm/blob/4a8629ce/tests/local_testing/test_streaming.py

first chunk

有些为了紧凑，会在第一个 chunk 包含内容
正常情况第一个 chunk 不应该包含内容

last chunk

vLLM, OpenAI 最后一个 chunk 的 content 为空

{
  "index": 0,
  "delta": {
    "content": ""
  },
  "logprobs": null,
  "finish_reason": "stop",
  "stop_reason": null
}

参考
- https://github.com/BerriAI/litellm/issues/12417
  - LiteLLM 添加最后一个 chunk 的 content 为空

ToolChoice

auto
- 自动选择工具
required
- 必须使用工具
none
- 不使用工具

Thinking

https://ai.google.dev/gemini-api/docs/thinking
- https://ai.google.dev/gemini-api/docs/thinking#set-budget
- 不同模型支持逻辑不一样
- 2.5 Pro 不能关闭 128 to 32768
- 关闭 thinkingBudget = 0
- 动态 thinkingBudget = -1
- thinkingLevel
  - 默认 high
  - low, high
  - Gemini 3.0

{
  "contents": [
    {
      "parts": [
        {
          "text": "Provide a list of 3 famous physicists and their key contributions"
        }
      ]
    }
  ],
  "generationConfig": {
    "thinkingConfig": {
      "thinkingLevel": "low"
    }
  }
}

Interleaved thinking

思考过程可以进行 tool call

Claude 4+
- interleaved-thinking-2025-05-14
- Messages API 才支持
MiniMax-M2
Kimi-K2-Thinking

reasoning_details

{
  "type": "reasoning.summary",
  "summary": "The model analyzed the problem by first identifying key constraints, then evaluating possible solutions...",
  "id": "reasoning-summary-1",
  "format": "anthropic-claude-v1",
  "index": 0
}

type
- reasoning.summary
- reasoning.encrypted
- reasoning.text
维护思考细节信息
- OpenAI o
- Claude 3.7+ thinking
- Gemini Reasoning
- xAI Reasoning

https://openrouter.ai/docs/guides/best-practices/reasoning-tokens

Preserved thinking

智普 GLM 4.7 支持保留思考内容
- 再生成 chat template 的时候允许传递之前的思考内容
- 默认不保持 clear_thinking: true
- https://huggingface.co/zai-org/GLM-4.7/blob/main/chat_template.jinja
  - clear_thinking 控制是否包含 reasoning_content

https://docs.bigmodel.cn/cn/guide/capabilities/thinking-mode

role

developer
system
user
assistant
tool
- 新版本 openai
- Anthropic 使用 user role
function
- 旧版本 openai

usage

付费
- 算力
- pay per token
- pay per request
- pay per item
  - 图、语音

abort

stream 499 会产生费用
非strema 中断也会产生费用
- 极端情况会产生完整的费用
Agent 实现在中断时候需要预估 usage
- 否则 context window 会失准

Token usage unavailable during streaming abort/interruption https://github.com/vercel/ai/issues/7628

Prompt Cache

模型 / 场景	最小缓存 Token 数
Claude Opus 4.5	4096
Claude Opus 4.1, 4	1024
Claude Sonnet 4.5, 4, ~~3.7~~	1024
Claude Haiku 4.5	4096
Claude Haiku ~~3.5~~, 3	2048
Gemini 3 Pro Preview	4096
Gemini 3 Flash Preview	1024
Gemini 2.5 Pro	4096
Gemini 2.5 Flash	1024
Gemini Explicit Caching (Vertex AI)	4096
Gemini Context Caching (Early Versions)	32768
OpenAI GPT	1024

Implicit Caching: 提供 75% - 90% 的输入 Token 折扣。
Explicit Caching: 按生存时间 (TTL) 收取存储费用。
容量: 最大缓存大小等同于模型完整上下文窗口（可超过 100 万 Token）。
Gemini 3 优化: 在 Gemini 3 系列中，建议 Prompt 前缀或缓存数据至少达到 4096 Token 以确保缓存生效并有效降低 API 成本。
Google OpenAI API extra body
⚠️ Tool call 缓存实际缓存的是 schema+描述等

{
  "google": {
    "cached_content": "cachedContents/XXX",
    "thinking_config": {
      "thinking_level": "low",
      "include_thoughts": true
    }
  }
}

Gemini API

Multiple tools are supported only when they are all search tools​

OpenAI API

streaming​

first chunk​

last chunk​

ToolChoice​

Thinking​

Interleaved thinking​

reasoning_details​

Preserved thinking​

role​

usage

Prompt Cache​

Multiple tools are supported only when they are all search tools

streaming

first chunk

last chunk

ToolChoice

Thinking

Interleaved thinking

reasoning_details

Preserved thinking

role

Prompt Cache