Kimi K2.5GLM-5Claude Sonnet 4.6Open-Source LLMAgentic AIAI CodingDeveloper Productivity

Open-Weight Coding LLMs in 2026: A Task-Type Playbook for Kimi K2.5, GLM-5, and When to Pay for Claude

Dmytro ChabanDmytro Chaban
May 19, 2026Updated May 19, 202624 min read
Open-Weight Coding LLMs in 2026: A Task-Type Playbook for Kimi K2.5, GLM-5, and When to Pay for Claude

Every post about open-weight coding models opens the same way: benchmark table, price-per-token chart, "Kimi wins" or "GLM-5 wins" headline. That framing has cost me real time. The model that wins on SWE-bench Verified is not the model I'd hand a 4-hour multi-file refactor to, and the model with the cheapest input tokens is often the most expensive one once I count the failed runs.

The question that actually matters when you sit down to work is narrower: which task gets routed where?

This is the playbook I use to make that decision in 2026, across Kimi K2.5, GLM-5, and Claude as the escalation tier. Task-type first, model second. Treat it as a team policy rather than a personal preference and the "which model should we use?" debate mostly evaporates.

The three contenders, one paragraph each#

Kimi K2.5 is Moonshot AI's 1-trillion-parameter MoE (32B active) released January 2026. The thing that matters is the training method — PARL (Parallel-Agent Reinforcement Learning) — which teaches an orchestrator to decompose tasks and delegate to frozen sub-agent copies running in parallel. API pricing on Moonshot is $0.60 per million input tokens and $3.00 per million output, with a 256K context window. For coding workflows, the sweet spot is a Synthetic.new subscription at $60/month, which gives you significantly higher request limits than Claude's $200 tier.

GLM-5 is Zhipu AI's 744B parameter MoE (40B active) released February 12, 2026, under an MIT license. Pricing on z.ai is $0.80 / $2.56 per million tokens — competitive on paper, with a 200K context. The MIT license is the real differentiator: you can self-host, fine-tune, and ship it in a product without negotiating with anyone. On benchmarks it edges Kimi (SWE-bench Verified 77.8% vs 76.8%, Intelligence Index 50 vs 47). In real agentic loops it has problems I'll show you how to reproduce.

Claude Sonnet 4.6 isn't on this playbook as a budget option. It's the reference ceiling — $200/month for unlimited Claude Code, $3/$15 per million tokens via API, Intelligence Index 51. When a budget model can't close a task, this is the model you escalate to. The reason isn't raw intelligence; it's the surrounding ecosystem — native tool routing, MCP support, persistent project memory, extended thinking. None of the open-weight options replicate that.

The four task buckets#

I sort almost every coding task I do into one of four buckets. The model assignment falls out of the bucket, not the other way around.

BucketWhat it looks likeRoute toReason
AMechanical refactors, test expansion, scaffoldingKimi K2.5Fast, structured, validation is cheap
BMulti-step agentic workflows (research + execute)Kimi K2.5 (PARL swarm)Parallel sub-agents do what GLM-5 can't
CSelf-hosted / fine-tuned deploymentsGLM-5MIT license; the only one you actually own
DAmbiguous architecture, risk-heavy patchesClaude Sonnet 4.6Failure cost > token cost

The rest of this piece is how I figured those out, what breaks the rules, and how to ladder the spend.

Task A — Mechanical refactors, test expansion, scaffolding#

Route to: Kimi K2.5

This is the bucket where Kimi earns its keep. The shape of the work is "I already know what the answer looks like, I just need a draft fast enough that I'm reviewing instead of typing."

Concretely:

  • Rename patterns across many files.
  • Interface alignment between modules after a contract change.
  • Test suite expansion for known behavior — regression tests, edge-case coverage around an existing function, fixture variations.
  • Integration scaffolding — typed adapters, wrappers, migration shims, first-pass changelogs.
  • Pure structural moves — extract function, split file, lift a class into its own module.

The reason Kimi wins here is not raw intelligence. It's that iteration speed beats first-pass elegance when validation is straightforward. If a generated diff is wrong, the test suite or the type checker tells me in seconds and I re-prompt. The penalty for a bad pass is tiny. Claude's quality advantage doesn't compound on tasks where I'd reject 5% of either model's output anyway.

What this looks like on the bill: on Moonshot's API, a typical day of refactor work for me runs maybe 3M input and 800K output tokens — about $4 in API spend. The same workload through Claude Sonnet 4.6 ($3/$15 per million) is around $21. The Synthetic.new $60/month subscription removes rate limits as a concern entirely, which is what I actually want for batched mechanical work.

Switch off Kimi when: you hit subtle regressions you can't trace, or the "mechanical" refactor turns out to have a load-bearing assumption you didn't notice. That's no longer Task A — it's Task D.

Task B — Multi-step agentic workflows#

Route to: Kimi K2.5 with the PARL swarm enabled

This is the bucket the cluster of budget-model posts gets wrong most often. People test "Kimi vs GLM-5" by handing each one a single-shot prompt and comparing the output. Agentic work isn't single-shot. It's an orchestrator that needs to gather context from three or four places, make tool calls, observe the results, debug its own errors, and keep coherent over 50+ steps.

This is exactly where GLM-5 falls down, and exactly where PARL is built to win.

The Moonshot numbers tell the story:

The mechanism, stripped of jargon: a trainable orchestrator learns to decompose a task and delegate to frozen copies of itself running in parallel. Only the orchestrator gets RL updates. Instead of one agent doing 50 sequential tool calls, the orchestrator dispatches 4 sub-agents to do 12 tool calls each in parallel, then synthesizes and continues. That's the whole trick. The "swarm" name oversells it, but the speedup is real on research-heavy tasks where the bottleneck is wall-clock time on tool calls.

In two weeks of OpenClaw testing earlier this year I ran the same multi-file refactor tasks against both Kimi K2.5 and GLM-5 with identical configs. Representative example: a refactor where changes in one file required corresponding changes in 3+ other files. Kimi handled it in 12 minutes with clean output. GLM-5 fixed file A, then fixed file B incorrectly, then tried to fix file A again, and looped. After 45 minutes the task was incomplete.

That's the bucket. If the work involves real agentic execution — multi-tool, multi-step, with intermediate state — Kimi is the budget pick that actually finishes the run.

Task C — Self-hosted or fine-tuned deployments#

Route to: GLM-5

This is the only bucket where I'd choose GLM-5 over Kimi. Not because GLM-5 is better at coding — it's competitive on benchmarks but loses on agentic reliability — but because GLM-5 is the only one of the three you actually own.

The MIT license is real. Weights are on HuggingFace and ModelScope, there are no usage restrictions, no rate limits other than your own hardware, no provider that can deprecate you. For a few specific situations this matters more than benchmark scores:

  • You're shipping a product where the model must run on customer infrastructure (regulated industries, air-gapped environments, EU data residency without an EU API endpoint).
  • You need to fine-tune on proprietary code that you can't send to a third-party provider.
  • Your unit economics require fixed per-hour GPU cost, not per-token API cost, because your traffic pattern is steady and high.
  • You want to fork the model and ship a quantized variant for edge inference.

At 744B total / 40B active, full-precision inference is not a hobbyist deployment — you're looking at a multi-GPU setup. But Unsloth quantized versions can run on consumer hardware with enough RAM, and that's plenty for a controlled production workload.

Kimi K2.5 is Apache-2.0 licensed and the weights are technically available, but at 1T total / 32B active the practical self-host story is harder. Claude is closed. For "I need to deploy a capable coding model into my own infra and call it mine," GLM-5 wins the bucket cleanly.

Task D — Ambiguous architecture, risk-heavy patches#

Route to: Claude Sonnet 4.6

This is the escalation tier. The defining feature of Task D is that the failure cost is higher than the token cost by at least an order of magnitude.

Concretely:

  • A production hotfix where a bad patch causes a customer-facing outage.
  • An architecture decision that creates months of maintenance debt if interpreted wrong.
  • Debugging a race condition or memory leak where "almost correct" is still broken.
  • A change in a large unfamiliar codebase where the model needs to surface implicit assumptions before acting.
  • Anything where the fix needs to be reviewed by another senior engineer and they're going to ask "why" — Claude is materially better at building that "why" into the change.

The pattern I see consistently: on Task A and B, Claude is 10-20% better quality than Kimi at 5-8x the price, which is a bad trade. On Task D, Claude is 40-50% better and the work is too important to take the Kimi penalty. The math flips entirely.

The other piece of Task D is the Claude Code environment. For risk-heavy work, the persistent project memory, the extended thinking mode, the native MCP integrations, the way it asks clarifying questions before acting on an ambiguous request — none of that exists in the open-weight world. You can route a Kimi or GLM-5 backend through Claude Code, but as one r/opencodeCLI commenter put it: "you lose all the integrated features Claude Code gives you. You get a raw API call wearing a familiar UI."

For Task D I want the full stack, not a partial one.

Routing rules that survive the next model launch

I publish the actual playbook updates the week new open-weight models drop. No leaderboard noise — what to route where.

By subscribing you agree to receive email updates. See our Privacy Policy.

The GLM-5 loop failure: a reproducible recipe#

The most important thing to know about GLM-5 before you put it behind an agent is the loop-failure mode. It's not subtle and you can hit it deliberately.

Setup to reproduce:

  1. Use OpenCode or OpenClaw with GLM-5 via z.ai as the backend.
  2. Give it a multi-file refactor task where changes in file A force corresponding changes in files B, C, and D — and one of those changes affects how A consumes its own output.
  3. Include a TDD-style prompt: "Run the failing tests first, fix until they pass, then run the full suite to verify no regressions."
  4. Walk away.

What you'll come back to is a model that has fixed A, then fixed B incorrectly because it lost track of the contract, then "fixed" A again in a way that broke C, then attempted to fix C in a way that re-broke B. There is no escape state in its planning loop. The Reddit r/opencodeCLI thread on this had it nailed: "when I provide them a plan with a TDD suite, their reliability becomes a problem. They can finish simple tasks, but anything with a more complex logic puts them in a loop with no escape."

The technical reason, as best I can tell: GLM-5's training optimized hard for single-step completion quality (what the benchmarks measure) and didn't sufficiently penalize the failure mode where the model commits to a plan, observes that step 2 broke step 1, and then patches step 1 in a way that breaks step 3. It lacks the "stop, replan, ask for input" instinct Claude has and Kimi mostly has through PARL's orchestrator.

Throughput collapse compounds the problem. GLM-5's spec is 70 tokens/second. During peak hours on z.ai, the r/opencodeCLI community has consistently measured it at 15-25 tokens/second. For an agentic workflow doing 200+ tool calls, this compounds into many minutes of added latency. A 45-minute loop failure on z.ai is a 15-minute loop failure on a faster provider — still a loop failure.

SiliconFlow is the most commonly recommended alternative provider — faster, slightly cheaper, OpenAI-compatible endpoint. It removes the throughput problem but not the loop-planning problem. If you're committed to GLM-5 in production, SiliconFlow is the right hosting choice. But don't expect a provider switch to fix the agentic reliability issue. That's in the model.

PARL swarm without the jargon#

Strip "Parallel-Agent Reinforcement Learning" down to what it actually is and the value is easier to evaluate.

A normal agent loop looks like this: prompt → think → call tool → observe → think → call tool → observe → ... for 50 or 100 steps. Each tool call is sequential. If a research task needs you to read 4 documents before deciding what to do next, you wait for all 4 reads in order.

PARL changes the loop to: prompt → orchestrator decides "I need to read these 4 documents in parallel" → 4 frozen sub-agents each read one doc → orchestrator synthesizes results → continues. Same total work, ~4x less wall-clock time on the parallelizable parts.

The "trainable orchestrator + frozen sub-agents" framing matters because it explains why this isn't just multi-threading. During RL training, only the orchestrator's parameters get updated — it learns when to decompose, how to slice the work, which tools to dispatch to which sub-agent, and how to merge the answers. The sub-agents are deliberately frozen copies of the base model, so they're not getting cleverer over time — they're just dependable workhorses.

The practical effect:

  • Research-heavy tasks finish 3-4.5x faster. This is the biggest single workflow win in any of the budget models right now.
  • Multi-source synthesis is more accurate — BrowseComp went from 60.6% to 78.4% with the swarm enabled. The orchestrator catches contradictions between sub-agent outputs that a single-agent run wouldn't even surface.
  • Cost stays approximately flat — you're paying for the same total tool calls, just in parallel.
  • Output coherence stays high because the orchestrator is doing the synthesis, not stitching outputs from N independent runs.

The cost: PARL is built into Kimi K2.5's training. You can't bolt it onto GLM-5 or Claude. So this is a real, durable advantage of Kimi for as long as it remains the only model with a swarm-trained orchestrator.

Cost math by task type, not by month#

The usual cost math is "here's the per-million-token price, multiply by estimated usage, compare to Claude's $200 sub." That math is wrong in a specific way that costs people money.

It misses rework overhead — the multiplier on your token spend caused by failed runs that have to be redone. Rework is the single biggest variable in real-world model economics, and it's task-type dependent.

A useful framing per bucket:

BucketKimi K2.5 costGLM-5 costClaude costEffective winner
A — Mechanical~$4-8/day API + low rework~$3-6/day API + low rework~$20-30/day APIKimi (close to GLM-5 but more reliable)
B — Agentic~$10-15/day on PARL + low rework~$8-12/day API + 1.5-2x rework$200/mo subscriptionKimi (effective cost drops below GLM-5 after rework)
C — Self-hostedN/A practicallyFixed GPU $/hr + zero per-tokenN/AGLM-5
D — Risk-heavy~$10-15/day + high rework when it fails~$8-12/day + very high rework$200/mo subscriptionClaude (rework overhead on the budget models exceeds the Claude premium)

The number that gets lost is the dollar value of your time. One 45-minute loop failure on GLM-5 — the kind I described above — costs ~33 minutes more than a 12-minute Kimi run would have. At $100/hour fully-loaded developer cost, that's $55 of productivity gone. That's more than two months of the price difference between GLM-5 and Kimi K2.5 on a typical workload. Token-price comparisons never include this.

The actionable rule: price by completed task, not by per-million-token rate. A model that costs 2x more per token but completes the task in 1 attempt instead of 1.7 is cheaper. This is where Kimi quietly wins on Task B and Claude quietly wins on Task D.

Provider matrix, with the failure modes#

The "which provider should I use for Kimi or GLM-5?" question is real and it changes the cost math. Here's the matrix I keep in my head:

Moonshot (Kimi K2.5 — official): $0.60 input / $3.00 output per million tokens. The reference. Reliable, good throughput, official PARL swarm support. Use when you want pay-as-you-go API access and don't need a subscription. The dashboard is slightly clunky for non-Chinese-speaking users but workable.

Synthetic.new (Kimi K2.5 and others): $60/month Pro tier, $20/month Standard. The Pro tier is the sweet spot — significantly higher request volume than Claude's $200 tier, comparable throughput to Moonshot's API, fixed monthly cost. Use this when your workload is heavy enough that you'd rather not watch a per-token meter. The Standard tier ($20) is fine for light use but you'll hit limits faster than Moonshot's API would charge you for the same workload.

SiliconFlow (GLM-5 and Kimi): OpenAI-compatible endpoint, faster than z.ai for GLM-5 specifically, generally competitive pricing. Use when committed to GLM-5 in production. Failure mode: it's a third-party relay, so you're stacking another provider's reliability on top of the model's. For Kimi, Moonshot or Synthetic are usually better.

OpenRouter (everything): OpenAI-compatible API in front of dozens of model providers. Pricing varies by underlying provider and OpenRouter takes a small margin. Use when you want to route the same agent across multiple backends without rewriting integration code. Failure mode: when a provider has an outage, OpenRouter's failover isn't always graceful, and per-provider rate limits still apply.

z.ai (GLM-5): The official source. Cheap on paper, slow in practice. Use only if you specifically need direct access to Zhipu's latest model variants the day they ship.

Claude API + Claude Code: Not in the budget tier, but worth naming for completeness. $200/month subscription = unlimited Claude Code, $3/$15 per million tokens for raw API. Use as the escalation tier for Task D. The math is covered in detail in my best-value LLM subscriptions guide.

When not to use a budget model#

This is the section the cluster of source posts kept gesturing at without ever quite saying out loud. There are workloads where running a budget open-weight model is materially worse than running nothing.

A short checklist. If any of these are true, route to Claude (or another premium model) regardless of the per-token cost:

  • The change will be merged and deployed without a senior engineer reviewing it.
  • The code path is exercised by customer-facing requests and a regression has a financial cost.
  • The task involves auth, billing, payments, PII handling, or any compliance surface.
  • You can't easily roll back if the change is wrong (e.g., it's a database migration).
  • The work requires understanding why something is the way it is (architectural decision history, "Chesterton's fence" patches) before changing it.
  • You're handing the agent a long-running autonomous loop with no human checkpoint.
  • The output is a contract — an API definition, a published schema, a config file other systems depend on.

Most of those reduce to one principle: budget models are great executors and unreliable judges. When the task is "do this clearly-specified thing fast and cheap," they win. When the task contains a hidden judgment call — should I do this at all, is this assumption safe, will this scale, who else depends on this — they don't reliably catch it, and the cost of a missed judgment call is asymmetric.

This is also why running a budget model behind an autonomous agent with no review loop is the most common way I see teams burn money on cheap LLMs. The token bill stays low. The cleanup bill is invisible until it's not.

The migration ladder#

How I'd ladder open-weight adoption if I were starting fresh today:

Step 1 — Claude Pro at $20. Use it for two weeks and find out where you actually hit the Opus limits. (You will hit them.) The reason to start here is that Claude is the reference quality bar — you want to know what "good" looks like before you start trading it away for cost savings.

Step 2 — Add Synthetic.new at $60. Route Task A and Task B work through Kimi K2.5 via Synthetic. Keep Claude Pro for Task D. This is the highest-leverage move on the ladder: you offload 60-70% of token volume to a model that's good enough on the routed work, and you keep premium quality for the work where it matters. Total spend: $80/month for capacity that exceeds Claude Max 5x ($100/month) on volume and approaches Max 20x ($200/month) on quality-by-task-class.

Step 3 — Upgrade Claude Pro to Max 5x ($100). When the Task D workload outgrows Claude Pro's 30-40-minute Opus window, this is the next step. Now you have $160/month total ($60 Synthetic + $100 Claude Max 5x) and you're covering effectively everything except multi-day production-grade refactors.

Step 4 — Build the escalate-on-fail router. This is the optional power-user step. Wire your agent harness so the first attempt at any task goes to Kimi via Synthetic, and on failure (loop detected, output rejected by tests, confidence score below threshold, whatever signal works) the same prompt is re-tried on Claude. You get budget-model economics on the 70-80% of tasks that don't need a premium model, and you get the premium-model safety net for the rest, without having to pre-classify task bucket by hand.

Step 5 — Self-host GLM-5 (only if you actually need it). This is the rare step. If you've worked through steps 1-4 and you're hitting one of the Task C constraints (regulatory, data residency, fine-tuning need, fixed-cost unit economics), GLM-5 self-hosted is the answer. Otherwise skip it.

Most working engineers should land at Step 2 or Step 3. Step 5 is for a specific kind of product, not a general developer workflow.

What I'd tell someone choosing today#

The honest summary, with no hedging:

For agentic work in 2026, Kimi K2.5 via Synthetic.new at $60/month is the best open-weight value on the market and it isn't close. The PARL swarm is a real engineering advantage that GLM-5 doesn't replicate. The Synthetic.new subscription gives you more request headroom than Claude's $200 tier at less than a third of the price.

GLM-5 is the right model for one specific scenario — you genuinely need self-hosted weights — and it's the wrong model for almost everything else. The benchmark wins are real and the agentic reliability problems are also real. Don't let the headline numbers route you into a workflow GLM-5 can't sustain.

Claude Sonnet 4.6 is not in this comparison as a budget option. It's the escalation tier for the 15-25% of tasks where the failure cost makes the token cost irrelevant. If you're a working engineer billing your time at anything above $50/hour, the $200 Claude Max 20x subscription pays for itself within the first week of real use on Task D work.

Pick by task, not by model. The "which one is best?" framing is the wrong question, and the longer you stay in it, the more time you spend rerunning workflows on the model you should have routed somewhere else.

Frequently Asked Questions


Dmytro Chaban

Written by

Dmytro Chaban

AI engineer writing about agentic systems, MCP integration, and LLM comparisons. 10+ years building production software, 4+ focused on AI.

About Dmytro

Enjoyed this post?

Find out which LLM is cheapest for your use case — I test new models as they launch

By subscribing you agree to receive email updates. See our Privacy Policy.

No spam, unsubscribe anytime.

Related Posts