Claude Opus 4.7 vs 4.6: Breaking Changes and Migration Review (2026)

Anthropic shipped Claude Opus 4.7 on April 16, 2026, and the launch coverage is uniformly positive, which is exactly why a skeptical read is overdue. Here's what actually changed versus 4.6, what will break in your API calls tomorrow morning, and whether the migration is worth planning this week or next quarter.

Key Takeaways#

SWE-bench Pro jumps +10.9 (53.4 → 64.3) and CharXiv vision jumps +13.0. Coding and vision are the real wins.
BrowseComp agentic search regresses −4.4 pts and CyberGym is effectively flat. Not every task improved.
Four breaking API changes will trip existing code on migration day: sampling params removed, extended thinking budgets gone, thinking hidden by default, new tokenizer.
The new tokenizer uses 1.0 to 1.35× more tokens per input. Budget impact is real, even though pricing is unchanged.
Pricing stays at $5 / $25 per million tokens, 1M context at standard API pricing, no long-context premium.
Claude Code default effort already flipped to xhigh across all plans, and there's a new /ultrareview slash command.
Mythos Preview still out of reach. Opus 4.7 is the strongest model you can ship against today, but Anthropic is openly telegraphing what's next.

TL;DR: Opus 4.7 vs 4.6 in 60 Seconds#

Claude Opus 4.7 is a direct upgrade to 4.6 with real gains in coding (SWE-bench Pro +10.9, Verified +6.8), vision (CharXiv +13.0, max image resolution tripled to 2,576px), and knowledge work. Pricing and the 1M context window are unchanged. But four breaking API changes plus a new tokenizer mean you can't just swap the model ID. There's a mandatory migration audit before this goes to production.

	Opus 4.7	Opus 4.6
Model ID	`claude-opus-4-7`	`claude-opus-4-6`
Pricing	$5 / $25 per M tokens	$5 / $25 per M tokens
Context window	1M tokens	1M tokens
Released	April 16, 2026	February 2026

Bottom line: if you ship agentic coding or vision-heavy workloads, migrate this week. If you run BrowseComp-style search agents or can't audit your API calls right now, wait. Decision framework at the bottom.

Benchmarks: Where 4.7 Beats 4.6, Where It Doesn't#

Here's the honest read on the numbers. Coding and vision are up. Agentic search and cybersecurity are flat or negative. Anthropic's own chart includes all of this. Competitors just aren't calling it out.

Benchmark	Opus 4.7	Opus 4.6	Δ	Mythos Preview
SWE-bench Pro (agentic coding)	64.3%	53.4%	+10.9	77.8%
SWE-bench Verified	87.6%	80.8%	+6.8	93.9%
Terminal-Bench 2.0	69.4%	65.4%	+4.0	82.0%
HLE (no tools)	46.9%	40.0%	+6.9	56.8%
HLE (with tools)	54.7%	53.3%	+1.4	64.7%
BrowseComp (agentic search)	79.3%	83.7%	−4.4	86.9%
MCP-Atlas (scaled tool use)	77.3%	75.8%	+1.5	N/A
OSWorld-Verified (computer use)	78.0%	72.7%	+5.3	79.6%
Finance Agent v1.1	64.4%	60.1%	+4.3	N/A
CyberGym (vuln reproduction)	73.1%	73.8%	−0.7	83.1%
GPQA Diamond (grad-level reasoning)	94.2%	91.3%	+2.9	94.6%
CharXiv (visual, no tools)	82.1%	69.1%	+13.0	86.1%
CharXiv (visual, with tools)	91.0%	84.7%	+6.3	93.2%
MMMLU (multilingual Q&A)	91.5%	91.1%	+0.4	N/A

Coding: The Headline Upgrade#

SWE-bench Pro going from 53.4 to 64.3 is the single largest jump on any headline benchmark. That's not a rounding-error improvement. It's a reliability shift. Early-access testers are reporting the same thing from different angles: Cursor saw 70% on CursorBench versus 58% for 4.6, GitHub measured a 13% lift on their 93-task internal benchmark, and Rakuten reports resolving 3× more production tasks. The practitioner read: this is the reliability jump, not the raw-IQ jump. Tasks you previously couldn't hand off without babysitting are now worth retrying.

If you're already reading comparisons like our Claude Opus 4.6 vs Codex 5.3 showdown, the 4.7 numbers move the coding comparison decisively in Claude's favor on Pro.

Vision: The Sleeper Upgrade#

CharXiv no-tools jumping +13.0 is bigger than any coding delta, and almost nobody is talking about it. The ceiling for image resolution went from 1,568px (~1.15 megapixels) to 2,576px (~3.75 megapixels). That's 3.3× the pixel count. Coordinates now map 1:1 with actual pixels, which eliminates the scale-factor math that computer-use agents have been doing by hand. XBOW reported 98.5% on their visual-acuity benchmark versus 54.5% for 4.6. That's the extreme case, not the median, but the direction is clear: computer-use agents, dense screenshot parsing, and anything doing patent or diagram extraction are the obvious winners.

Where 4.7 Regresses#

BrowseComp dropped 4.4 points. On the benchmark Anthropic uses to measure agentic web search, 4.7 is meaningfully worse than 4.6. If your production workflow is search-heavy (browsing, research, information gathering across sites), that workload may be a legitimate reason to stay on 4.6 for that specific agent.

CyberGym is down 0.7 points. Effectively flat, but notable because it's consistent with the new real-time cybersecurity safeguards: 4.7 is trading some raw capability for refusal-on-high-risk. Multilingual Q&A (MMMLU) is up 0.4 points. Call it flat.

None of the other 2026-04-16 launch coverage flags these regressions. That's the gap you should notice.

The Mythos Preview Context#

Every row on Anthropic's chart shows Mythos Preview above 4.7. SWE-Pro 77.8 versus 64.3. SWE-Verified 93.9 versus 87.6. CharXiv 86.1 versus 82.1. Anthropic is openly telegraphing the next model and holding it back behind Project Glasswing safety testing. The strategic read: don't over-invest in 4.7-specific optimization if your roadmap runs quarterly. Opus 4.7 is a release floor for Mythos-class deployment, not a frontier destination.

Pricing and Limits: Unchanged, and That's the Real Story#

$5 per million input tokens. $25 per million output tokens. Identical to 4.6. The 1M context window is at standard API pricing, no long-context premium. Max output stays at 128k tokens.

That price parity is underrated. Anthropic's internal coding eval shows token efficiency improved at each effort level even before accounting for the new tokenizer. That means same price, more output per dollar on coding workloads. For a live comparison against every other frontier model, see the LLMx pricing tracker.

Stay ahead of the next model launch

Get practitioner-grade breakdowns of Claude, GPT, and Gemini releases: what actually changed, what breaks, and whether to migrate.

The Breaking API Changes That Will Bite You#

This is the part news sites glossed over. Four things change defaults or return 400 errors the moment you switch the model ID. No code snippets here, just the practitioner-to-practitioner map of what to audit before migrating.

1. Extended Thinking Budgets Are Gone#

Sending thinking: {type: "enabled", budget_tokens: N} now returns a 400 error on 4.7. Adaptive thinking is the only thinking-on mode, and it's off by default. You have to set thinking: {type: "adaptive"} explicitly. If your client code assumed thinking was always on for Opus, you'll silently lose chain-of-thought on migration day until you update the request shape.

2. Temperature, top_p, top_k Return 400#

Non-default values for temperature, top_p, or top_k now fail outright. The safe migration is to remove them from requests entirely and control behavior through prompting. If you were using temperature=0 for determinism, that was always a myth (the removal isn't a real loss). But audit your code for hardcoded sampling parameters before you flip the model ID in production.

3. Thinking Content Hidden by Default#

Thinking blocks still appear in the response stream, but the thinking field is now empty unless the caller opts in. If your product shows reasoning to users, it'll look like a long pre-output pause. Opt back in with display: "summarized" in the thinking config. Streaming UIs should plan for this specifically: it's a silent behavior change with no error raised.

4. New Tokenizer: Budget 1.0 to 1.35× More Tokens#

Same input text now maps to up to 35% more tokens on 4.7. The /v1/messages/count_tokens endpoint returns different numbers than it did for 4.6. Practical cost math: a 100k-token workload that cost you $500 per million output tokens on 4.6 could land at $675 per million effective on 4.7 before any response-length changes. Anthropic's own testing shows net favorable because 4.7 thinks more efficiently at each effort level. But measure on real traffic before assuming that applies to your workload. Bump max_tokens headroom and compaction triggers.

`xhigh` Effort Level and Adaptive Thinking#

The new xhigh effort level sits between high and max. Anthropic's guidance: start with high or xhigh for coding and agentic use cases. Claude Code already flipped the default to xhigh across all plans. If you use Claude Code, you're probably already on it without knowing. The Hex team reported that low-effort 4.7 is roughly equivalent to medium-effort 4.6, which is useful for cost control. For a deeper walkthrough on tuning effort per subscription tier, see how to change Claude Code effort level.

High-Resolution Vision: Who Actually Benefits#

The 2,576-pixel ceiling and 1:1 pixel-coordinate mapping open up a specific set of workflows. Computer-use agents reading dense screenshots. Chart and figure transcription where pixel-level detail matters. Life-sciences diagram parsing (Anthropic cites Solve Intelligence for patent workflows). Legal document extraction. If your product doesn't touch any of these, the vision upgrade is nice-to-have, not load-bearing. Caveat: high-resolution images consume proportionally more tokens, so downsample inputs where you don't need the extra fidelity.

Task Budgets (Beta): When to Use, When Not To#

Task budgets are an advisory token allowance for the entire agentic loop: thinking, tool calls, tool results, and final output combined. The model sees a running countdown and self-moderates. Activate via the task-budgets-2026-03-13 beta header. Minimum budget is 20,000 tokens.

The important distinction: max_tokens is a hard per-request cap the model doesn't see. task_budget is a soft target the model uses to prioritize. Use it for fixed-cost agent workloads and production cost guardrails. Skip it for open-ended quality-first work where you want the model to use whatever it needs.

Behavior Changes You'll Notice in Production#

These aren't breaking API changes, but prompts written for 4.6 may produce unexpected results because 4.7 is different in production.

More literal instruction following. The model won't silently generalize from one item to another, and it won't infer requests you didn't make. Lower effort levels are especially literal.
Response length calibrates to task complexity, not a fixed verbosity default.
Fewer tool calls by default, more internal reasoning. Raise effort to increase tool usage.
Fewer subagents spawned by default. Steerable through prompting.
More direct, opinionated tone. Less validation-forward language, fewer emoji than 4.6's warmer style.
More regular progress updates during long agentic traces. Remove scaffolding you added to force interim status messages.
Real-time cybersecurity safeguards. Prohibited or high-risk security queries may now get refused. Legitimate work goes through Anthropic's Cyber Verification Program.

If you're managing prompts across model upgrades at scale, our take on prompt versioning for agentic systems becomes more important with 4.7 than it was with 4.6. The literal-instruction shift is the big one.

Knowledge Work and Memory Gains#

The .docx redlining and .pptx editing workflows got real improvements. Better self-checking on tracked changes, better slide layouts. Chart and figure analysis improved through better programmatic tool-calling with image libraries like PIL. Pixel-level data transcription from charts is meaningfully more reliable. File-system-based memory (scratchpad files, structured notes across sessions) is a tighter story on 4.7. If you've been writing scaffolding like "double-check the slide layout before returning," remove it and re-baseline.

Also Launching With 4.7#

Three adjacent launches worth knowing about but not the main story: a new /ultrareview slash command in Claude Code for dedicated review sessions (Pro and Max users get three free), auto mode extended to Max users for long-running tasks with fewer interruptions, and Anthropic's Figma/Word/PowerPoint/Excel integrations continuing to ship. If you're deciding between subscription tiers, our best-value LLM subscriptions guide has the math on whether Claude Max makes sense now that /ultrareview is in the mix.

Decision Framework: Should You Migrate Today?#

Migrate Now If…#

Your workload is coding-heavy, especially long-running or agentic.
You use Claude Code. The default effort already flipped to xhigh and /ultrareview is live on launch day.
You have computer-use agents reading screenshots or parsing dense UIs.
You process complex documents, charts, or technical diagrams.
Your prompts already use adaptive thinking (or you've been meaning to switch).
You can spare two hours to audit the four breaking API changes before production cutover.

Wait If…#

Your workload depends on BrowseComp-style agentic search. The regression is real.
You have temperature, top_p, or top_k hardcoded in critical paths and can't audit this week.
Your production budget can't absorb a 1.0 to 1.35× tokenizer inflation without a review cycle.
You do high-volume cybersecurity work that isn't approved under the Cyber Verification Program.
Mythos-class capabilities are on your next-quarter roadmap. Opus 4.7 is a waypoint, not a destination. Don't build 4.7-specific optimizations if you're going to redo them for Mythos.

For a broader framework on picking the right model for your workload, see how to identify the best model for your work. And for a real-agent pipeline comparison across models in this generation, our Gemini 3.1 Pro vs Claude Sonnet 4.6 vs Opus 4.6 pipeline test is the closest precedent for how these numbers translate to production.

Frequently Asked Questions

Four things on the Messages API: setting non-default temperature, top_p, or top_k returns a 400 error. Extended thinking budgets (budget_tokens) also return 400. You must use adaptive thinking, which is off by default. Thinking content is hidden by default even when adaptive thinking is on, so opt back in with display summarized. And the new tokenizer produces 1.0 to 1.35× more tokens per input, so budget headroom needs to grow. Managed Agents is unaffected by all four changes.

Yes, meaningfully. SWE-bench Pro goes from 53.4% to 64.3% (+10.9 points), SWE-bench Verified from 80.8% to 87.6% (+6.8), and Terminal-Bench 2.0 from 65.4% to 69.4% (+4.0). Early testers report similar lifts: Cursor saw 70% versus 58% on CursorBench, GitHub measured a 13% lift on their internal 93-task benchmark, and Rakuten reports resolving 3× more production tasks. Coding is the clearest upgrade in this release.

xhigh is a new effort tier that sits between high and max. Anthropic recommends starting with high or xhigh for coding and agentic use cases, and using a minimum of high for intelligence-sensitive tasks. Claude Code already defaults to xhigh across all plans on 4.7. The effort parameter controls how many tokens Claude spends on reasoning; higher effort means more capability at the cost of speed and tokens. The Hex team reported that low-effort 4.7 is roughly equivalent to medium-effort 4.6, which matters for cost control.

Pricing is unchanged from Opus 4.6: $5 per million input tokens and $25 per million output tokens. The 1M context window runs at standard API pricing with no long-context premium. The practical cost catch is the new tokenizer, which maps the same input text to 1.0 to 1.35× as many tokens. Even though per-token pricing is identical, effective cost on the same workload can rise up to 35% before accounting for 4.7's more efficient reasoning at each effort level.

Anthropic's framing is that non-default sampling parameters produced inconsistent behavior and that prompting is a more reliable way to control output. The removal is strict: any non-default value for temperature, top_p, or top_k returns a 400 error. If you were using temperature=0 for determinism, that never actually guaranteed identical outputs anyway, so the removal isn't a real loss. Migration path: strip these parameters from requests and use prompt instructions to control behavior.

Mythos Preview is Anthropic's most capable model, positioned above Opus 4.7 on virtually every benchmark (SWE-Pro 77.8, SWE-Verified 93.9, CharXiv 86.1). Anthropic is holding its release limited under Project Glasswing, a safety initiative specifically because Mythos-class models have cybersecurity capabilities that need real-world safeguards proven first. Opus 4.7 is the first model shipping with the automated cyber safeguards, and what Anthropic learns from 4.7 deployment informs Mythos's broader release. No public release date yet.

Claude Opus 4.7 vs 4.6: What Changed, What Breaks, Should You Migrate (2026)

Key Takeaways#

TL;DR: Opus 4.7 vs 4.6 in 60 Seconds#