Model Backends — Run Any Model Through Claude Code

The Idea

Claude Code Is Model-Agnostic

Claude Code talks to an API endpoint. Override ANTHROPIC_BASE_URL and the model resolves to whatever that endpoint serves. Your skills, MCP servers, hooks, memory, and CLAUDE.md all load normally — the model is just the brain, everything else is the same.

Three providers offer native Anthropic-compatible API endpoints — no third-party proxy like LiteLLM or OpenRouter needed. Your requests go directly to the provider's servers, though latency is still higher than native Claude MAX due to the additional network hop:

GLM (z.ai)

GLM-5.1 on opus, GLM-4.7 on sonnet & haiku. 200K context. Coding plan. See token burn note below.

Kimi (Moonshot)

Kimi K2.5 — 262K context. Subscription plan ($30/mo). Token-efficient, comparable to Sonnet 4.5. Also ships its own CLI (kimi).

Qwen (Alibaba)

Qwen 3.6 Plus via Alibaba Cloud Coding Plan ($50/mo). Also bundles Kimi, GLM, and MiniMax models under one API key.

Your terminal — same machine, same directory, different backends ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ claude │ │ claude-glm │ │ claude-kimi │ │ claude-qwen │ │ Opus 4.6 │ │ GLM-5.1 │ │ Kimi K2.5 │ │ Qwen 3.6+ │ │ 1M context │ │ 200K │ │ 262K │ │ Coding Plan│ │ MAX plan │ │ z.ai │ │ $30/mo │ │ $50/mo │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────────────┴──────────────┴──────────────┘ same skills, MCP, hooks, CLAUDE.md

Setup

Shell Aliases — Two Minutes to Multi-Model

Create a file that defines aliases for each backend. Source it from your shell profile so it survives reboots. Each alias launches the same claude binary with different environment variables.

~/.ai-aliases.zsh

# GLM — GLM-5.1 on opus, GLM-4.7 on sonnet/haiku/subagents
# 200K context, z.ai coding plan
alias claude-glm='ANTHROPIC_BASE_URL="https://api.z.ai/api/anthropic" \
  ANTHROPIC_API_KEY="your-glm-api-key" \
  ENABLE_TOOL_SEARCH=false \
  ANTHROPIC_DEFAULT_OPUS_MODEL="GLM-5.1" \
  ANTHROPIC_DEFAULT_SONNET_MODEL="GLM-4.7" \
  ANTHROPIC_DEFAULT_HAIKU_MODEL="GLM-4.7" \
  ANTHROPIC_DEFAULT_OPUS_MODEL_NAME="GLM-5.1" \
  ANTHROPIC_DEFAULT_SONNET_MODEL_NAME="GLM-4.7" \
  ANTHROPIC_DEFAULT_HAIKU_MODEL_NAME="GLM-4.7" \
  CLAUDE_CODE_SUBAGENT_MODEL="GLM-4.7" \
  CLAUDE_CODE_MAX_OUTPUT_TOKENS="32000" \
  CLAUDE_CODE_AUTO_COMPACT_WINDOW="180000" \
  CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC="1" \
  CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS="1" \
  API_TIMEOUT_MS="3000000" \
  claude'

Kimi K2.5 alias (same file)

# Kimi K2.5 (Moonshot) — subscription plan ($30/mo), 262K context
alias claude-kimi='ANTHROPIC_BASE_URL="https://api.kimi.com/coding/" \
  ANTHROPIC_API_KEY="your-kimi-api-key" \
  ENABLE_TOOL_SEARCH=false \
  ANTHROPIC_DEFAULT_OPUS_MODEL="kimi-for-coding" \
  ANTHROPIC_DEFAULT_SONNET_MODEL="kimi-for-coding" \
  ANTHROPIC_DEFAULT_HAIKU_MODEL="kimi-for-coding" \
  ANTHROPIC_DEFAULT_OPUS_MODEL_NAME="Kimi K2.5" \
  ANTHROPIC_DEFAULT_SONNET_MODEL_NAME="Kimi K2.5" \
  ANTHROPIC_DEFAULT_HAIKU_MODEL_NAME="Kimi K2.5" \
  CLAUDE_CODE_SUBAGENT_MODEL="kimi-for-coding" \
  CLAUDE_CODE_MAX_OUTPUT_TOKENS="32000" \
  CLAUDE_CODE_AUTO_COMPACT_WINDOW="230000" \
  CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC="1" \
  CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS="1" \
  API_TIMEOUT_MS="3000000" \
  claude'

Qwen alias (same file)

# Qwen 3.6 Plus (Alibaba Cloud Coding Plan) — $50/mo, multi-model plan
# Also bundles kimi-k2.5, glm-5, MiniMax-M2.5 under the same API key
alias claude-qwen='ANTHROPIC_BASE_URL="https://coding-intl.dashscope.aliyuncs.com/apps/anthropic" \
  ANTHROPIC_API_KEY="your-alibaba-coding-plan-key" \
  ENABLE_TOOL_SEARCH=false \
  ANTHROPIC_DEFAULT_OPUS_MODEL="qwen3.6-plus" \
  ANTHROPIC_DEFAULT_SONNET_MODEL="qwen3.6-plus" \
  ANTHROPIC_DEFAULT_HAIKU_MODEL="qwen3.6-plus" \
  ANTHROPIC_DEFAULT_OPUS_MODEL_NAME="Qwen 3.6+" \
  ANTHROPIC_DEFAULT_SONNET_MODEL_NAME="Qwen 3.6+" \
  ANTHROPIC_DEFAULT_HAIKU_MODEL_NAME="Qwen 3.6+" \
  CLAUDE_CODE_SUBAGENT_MODEL="qwen3.6-plus" \
  CLAUDE_CODE_MAX_OUTPUT_TOKENS="32000" \
  CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC="1" \
  CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS="1" \
  API_TIMEOUT_MS="3000000" \
  claude'

Alibaba Coding Plan bundles multiple models

The $50/mo Alibaba Cloud Coding Plan API key (sk-sp-xxxxx) also works with kimi-k2.5, glm-5, qwen3-coder-plus, and MiniMax-M2.5. You can use it as an alternative provider for your Kimi or GLM aliases by swapping the base URL to coding-intl.dashscope.aliyuncs.com/apps/anthropic and the model IDs to their Alibaba-hosted equivalents. One API key, multiple backends.

~/.zshrc (one line)

# Load AI aliases
[ -f "$HOME/.ai-aliases.zsh" ] && source "$HOME/.ai-aliases.zsh"

What the extra env vars do

Variable	Purpose
`ENABLE_TOOL_SEARCH=false`	Disables deferred tool schema loading. Required for GLM (see token burn note below). Recommended for all non-Anthropic backends.
`CLAUDE_CODE_SUBAGENT_MODEL`	Model used for subagent tasks. Without this, subagents may try to use the opus slot model unnecessarily.
`CLAUDE_CODE_MAX_OUTPUT_TOKENS`	Cap output tokens per response. Prevents runaway generation on models without native limits.
`CLAUDE_CODE_AUTO_COMPACT_WINDOW`	Token count at which auto-compaction triggers. Set below the model's actual context limit to compact before hitting the wall.
`CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC`	Disables telemetry and analytics calls that consume tokens and add latency on non-Anthropic backends.
`CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS`	Disables experimental features that may not work correctly with non-Anthropic API endpoints.

Authentication precedence

Claude Code checks credentials in this order: ANTHROPIC_AUTH_TOKEN > ANTHROPIC_API_KEY > OAuth login. Either variable works in your aliases — ANTHROPIC_API_KEY is sufficient and avoids overwriting the OAuth token. You'll see a one-time auth prompt on first launch; approve it and the session uses the API key cleanly.

Shell environment variables always override settings.json. No config file conflicts between backends — your regular claude command keeps using MAX, and each alias overrides cleanly.

First launch

The first time you run a backend alias, Claude Code may ask "Do you want to use this API key?" — say Yes. The choice is remembered for that session. If you see an auth conflict warning, it's safe to proceed — the API key takes precedence.

Comparison

The Model Matrix

Each backend has different strengths, context limits, and cost profiles. Pick the right one for the task.

Alias	Opus Slot	Sonnet Slot	Haiku Slot	Context	Best for
`claude`	Opus 4.6	Sonnet 4.6	Haiku 4.5	1M	Everything — primary driver
`claude-glm`	GLM-5.1	GLM-4.7	GLM-4.7	200K	Scaffolding, PR workflows, prototyping
`claude-kimi`	Kimi K2.5	Kimi K2.5	Kimi K2.5	262K	Coding, task delegation — token-efficient, comparable to Sonnet 4.5
`claude-qwen`	Qwen 3.6+	Qwen 3.6+	Qwen 3.6+	TBD	Alibaba Coding Plan — also bundles Kimi, GLM, MiniMax

Context limits matter

A fully-loaded Claude Code session (CLAUDE.md + skills + MCP server prompts + hooks + memory) consumes ~82K tokens of system prompt. GLM (200K) and Kimi (262K) have enough headroom for the full stack. Models with smaller context windows may need --bare mode.

Pipe Mode

The -p Killer Feature

Non-interactive mode (claude -p) works with any backend alias. Pipe a task in, get results out. Delegate batch work to alternative backends to save your MAX tokens for complex interactive sessions.

# Code review on GLM (coding plan)
git diff | claude-glm -p "Review this diff for bugs and security issues"

# File analysis on Kimi tokens (subscription)
claude-kimi -p "Explain the architecture of src/auth/" \
  --allowedTools "Read,Grep,Glob"

# Structured output for CI pipelines
claude-glm -p "List all TODO comments in this repo" \
  --output-format json \
  --allowedTools "Grep"

Cross-backend delegation from an active session

Inside any Claude Code session, you can shell out to another backend using the ! prefix:

# From a GLM session, delegate a complex task to Opus
! claude -p "Analyze the security of src/auth.ts" --allowedTools "Read,Grep"

# From a Claude MAX session, hand off a review to Kimi
! git diff | claude-kimi -p "Review for OWASP top 10 issues"

Case Study

GLM-5.1 Builds a Game in One Shot

The first task given to GLM-5.1 through Claude Code was a natural language game concept — a top-down zombie survival shooter inspired by Alien Swarm and Shaun of the Dead. The prompt was a casual brain dump, not a structured spec.

6

sentences in the prompt

889

lines of JavaScript

0

errors on first run

GLM-5.1 generated a fully playable Phaser 3 zombie survival game from a 6-sentence prompt

What GLM produced

Phaser 3 with arcade physics, zero-gravity top-down movement
Procedural pixel art sprites generated at runtime — no external assets
Wave-based zombie AI with 4 variants and escalating difficulty
Collision detection, health bar, ammo counter, score tracker
Real-time minimap with zombie positions
Blood splats, screen shake, muzzle flash, death animations
4 checkpoints across a procedurally generated town map
Complete README with 5-phase roadmap

All MCP servers (code search, memory, file operations) worked correctly throughout the session. GLM used context7 to look up Phaser 3 documentation, Read/Write/Edit for file operations, and ran the same /wrap end-of-session workflow that Claude does. Tool use reliability was indistinguishable from a native Claude session for this task.

Zero MAX token burn

This entire session — from concept to playable game — ran on GLM's coding plan. No Claude MAX tokens consumed. Same skills, same MCP servers, same workflow.

Production Use

GLM Handling Real PR Workflows

Beyond prototyping — GLM-5.1 running /pr-resolve on a real pull request. Discovering review threads, creating patches, pushing branches, managing git worktrees. Full MCP tool chain: GitHub API, file operations, git commands.

Split terminal: GLM-5.1 running pr-resolve on top, Claude Sonnet handling StakTrakr work on bottom — two models, two projects, parallel execution

Top pane: GLM resolving PR review threads — reading comments, patching code, pushing fixes. Bottom pane: Claude Sonnet running Playwright tests and managing worktrees on a different project. Two models, two projects, running simultaneously on the same machine.

This is the real benchmark

Game prototypes are fun, but PR resolution is production work. GLM navigating git branch, gh pr create, file edits, and GitHub MCP tools without supervision — on a separate coding plan that doesn't touch your MAX quota — is the data point that matters.

Case Study

Kimi K2.5 Tackles a Side-Scroller

Same night, harder prompt. Kimi K2.5 was given a similar brain dump but asked for a side-scrolling shooter (Metal Slug style) instead of top-down. Side-scrollers are objectively more complex — gravity, jump arcs, platform collision, directional shooting, parallax scrolling.

1,827

lines of JavaScript

3

rounds to working build

262K

context window

Kimi K2.5 generated a side-scrolling Phaser 3 zombie shooter with platforming, gravity, and wave-based enemies

Honest comparison

GLM shipped a working top-down game in one shot. Kimi needed three debugging rounds for the harder side-scroller. Neither result is directly comparable — the prompts were different, the complexity was different. A fair benchmark would use identical specs, which is planned for a future test.

What both results prove: alternative models can navigate Claude Code's full tool chain — file creation, MCP servers, Phaser documentation lookup, multi-file editing — and produce working software. The quality gap between models shows up in edge cases and debugging resilience, not in basic capability.

Prototype, not production

The Kimi game is a rough prototype — it needs real level design and gameplay tuning. But as a proof of concept for Kimi K2.5 running through Claude Code's workflow, it demonstrates that the 262K context model can handle multi-file Phaser projects with full MCP integration.

Constraints

Gotchas & Limitations

GLM token burn: ToolSearch incompatibility

Claude Code's deferred tool loading feature (ENABLE_TOOL_SEARCH, enabled by default) lets the model lazily fetch tool schemas on demand. GLM does not handle this flow natively — it loops on ToolSearch calls, burning millions of tokens on repeated schema fetches instead of resolving them. The fix is ENABLE_TOOL_SEARCH=false in your alias, which dumps all tool schemas upfront. The tradeoff: 60-70% of your context window is consumed at session start by ~100 MCP tool schemas. To reduce this, disable MCP servers you don't need for the task. This is a Claude Code compatibility issue with non-Anthropic backends, not a GLM API limitation — it may be resolved in a future Claude Code update.

Watch your token usage

If you forget ENABLE_TOOL_SEARCH=false in your GLM alias, you may see 12M+ tokens consumed in a session that should use ~300K. The symptom is rapid auto-compaction and sluggish responses. Add the flag, and consider trimming your MCP server list for GLM sessions.

Context limits are the real constraint

A fully-loaded Claude Code session (~82K tokens of system prompt) needs a model with enough context headroom. GLM (200K) and Kimi (262K) handle this fine. Models with smaller windows may need --bare mode, which strips plugins, skills, and MCP servers.

No mid-session switching

You cannot change backends during a conversation. Each terminal session is locked to whatever backend it launched with. To use a different model, open a new terminal tab.

Latency

Alternative backends talk directly to the provider's API (no third-party proxy), but the additional network hop still adds latency. Expect noticeably slower first-token and streaming response times compared to native Claude MAX, which benefits from Anthropic's optimized infrastructure and prompt caching.

Tool use reliability varies

Claude is purpose-built for Claude Code's tool calling patterns. Alternative models may struggle with complex multi-step tool chains (Read → Edit → Grep → Bash sequences). GLM-5.1 tested well on a substantial scaffolding task, but long-session reliability for all backends is still being evaluated.

Auth conflict on first launch

If you're logged into claude.ai (MAX subscription), launching an alias will show a one-time warning about conflicting credentials. Approve the API key when prompted — it takes precedence for that session. This is cosmetic; both ANTHROPIC_API_KEY and ANTHROPIC_AUTH_TOKEN work correctly.

Multi-CLI

OpenCode — All Models, One CLI

Shell aliases give you one model per terminal. OpenCode (open source, 75+ providers) lets you switch models mid-session with Ctrl+O. It reads your CLAUDE.md and ~/.claude/skills/ natively — same skills, same MCP servers, different UI.

Runtime switching

Configure all providers once. Pick the model per task at runtime — no alias dance, no new terminal.

Reads Claude skills

OpenCode discovers ~/.claude/skills/ automatically. All 88 skills work without migration.

Same MCP stack

Configure ~/.config/opencode/opencode.json with the same MCP servers. Slightly different JSON format.

Install

brew install anomalyco/tap/opencode — then create ~/.config/opencode/opencode.json with your MCP servers and provider keys. Run opencode mcp list to verify.

Beyond

Not Just Claude Code

These API keys work anywhere. The shell aliases and dedicated CLIs are the convenience layer — not the only integration point.

Direct REST

curl to any endpoint for scripted automation, CI pipelines, or custom tooling.

OpenAI SDKs

Kimi and other providers expose OpenAI-compatible endpoints. Drop them into any app that uses the OpenAI SDK.

MCP Servers

Any MCP server that accepts an API key and base URL can be pointed at these providers.

OpenRouter

580+ models through a single API key. Useful for benchmarking across many models without signing up for each provider.

The bigger picture

This setup is one piece of a multi-agent workflow that dispatches different models for different cognitive tasks. Claude MAX handles complex architecture. GLM handles scaffolding and prototyping. Kimi handles coding and delegation (with its own CLI too). Qwen adds another option via the Alibaba Coding Plan. Codex and Gemini run as peer CLI agents. The spec documents are the shared protocol — any agent can pick up a task from any other.