Best Frontier Coding LLM in 2026: A Real Agent-Pipeline Test of Claude Opus 4.7, GPT-5.3-Codex, and Gemini 3.1 Pro

We were halfway through a ticket-to-PR pipeline on Gemini 3.1 Pro when the model quietly switched jobs.
Step 1 had gone clean: it read the Jira ticket, pulled the requirements, identified the acceptance criteria. Step 2 was a Confluence dump — about 8,000 tokens of internal specs, API references, the usual archaeology. Buried somewhere in that pile was a single line, the kind of line that lives in every internal doc on earth: "we need to update the documentation."
Step 3 was supposed to be a web search. It wasn't. Gemini decided that sentence was its new task. It dropped the ticket, skipped the search, skipped the log debugging, and started writing documentation updates that absolutely nobody had asked for.
The pipeline didn't crash. It didn't throw an error. It just stopped doing the thing we wanted and started doing a different thing. That's not a benchmark failure. That's a production incident with the lights still green.
I've spent the last three months running every frontier coding model through some version of this pipeline — Claude Sonnet 4.6, Claude Opus 4.6, the new Opus 4.7, GPT-5.3-Codex, Gemini 3.1 Pro. Single-prompt benchmarks tell you almost nothing useful at this point. What matters is which of these models can survive a five-step agentic workflow without going off the rails. This post is the answer, told as a story rather than a spec sheet, because the story is where the real information lives.
The 2026 question isn't "which model is smartest"#
Every frontier model has frontier-grade single-shot intelligence now. Gemini 3.1 Pro hits 94.3% on GPQA Diamond. Claude Opus 4.7 lands at 64.3% on SWE-bench Pro, a ten-point jump from 4.6's 53.4%. GPT-5.3 has its own headline numbers. If you only ever pass these models a single prompt, you can pick any of them and ship.
But almost nobody runs production this way anymore. The work has migrated into agents — multi-step pipelines that read tickets, pull docs, search the web, run tests, edit files, open PRs. The question stopped being "which model writes the best single function?" and became "which model can hold a five-step plan in its head for two hours without forgetting why it started?"
That's a completely different competition. And the winners aren't always the models with the highest IQ.
The test: a real 5-step ticket-to-PR pipeline#
I'm not running synthetic benchmarks. I'm running the actual workflow my team uses to ship features. Five sequential steps, each one a real tool call against real systems:
- Jira read. Parse the ticket, pull related tickets and project context.
- Confluence read. Fetch internal docs, API specs, architecture notes.
- Web search. Fill remaining context gaps with external docs, changelogs, known issues.
- Log debugging. Analyze stack traces from the production environment.
- PR creation. Write the code fix and open a pull request with correct scope.
If a human did this end-to-end, you'd lose two or three hours of focused engineering time. Done by a competent agent, it's the difference between shipping the same day and shipping next week. Done by an incompetent agent, it's worse than not running the agent at all — because now somebody has to figure out what the model thought it was doing and clean it up.
Each model got the same prompt, the same tools, the same Jira ticket, the same Confluence corpus. The only thing that changed was the model behind the wheel.
Round 1: Gemini 3.1 Pro — derailed at step three#
Gemini handled steps 1 and 2 cleanly. The Jira parse was clean. The Confluence summary was good. The model demonstrated real reasoning chops — the kind of work that earned Gemini that 94.3% on GPQA Diamond and the 77.1% on ARC-AGI-2 it shipped with at launch.
Then step 3 didn't run.
The model read "we need to update the documentation" inside the retrieved Confluence content and treated it as a higher-priority instruction than the original system prompt. It re-anchored to the most recent large chunk of context. The ticket evaporated. The PR never got opened. The agent ran to completion and produced exactly the wrong output, confidently.
This isn't a one-off. Google's own agentic benchmarks show the pattern. Apex Agents improved from 18.4% to 33.5% over Gemini 3.0 — real progress, but still a long way from production. MCP Atlas multi-step lands at 69.2%, which means roughly one in three multi-step runs fails. In a pipeline where failure means a wrong-scope PR or unsanctioned documentation rewrites, that's not a number you can engineer around.
Three specific failure modes show up across multiple runs:
Instruction hierarchy drift. When Gemini 3.1 Pro processes a large tool result (8,000+ tokens), the original system-level task instructions lose attention weight relative to the new content. The model effectively re-anchors to the most recent large input — especially when that input contains language that reads like instructions.
Tool results treated as directives, not data. A well-scoped model reads tool output as context. Gemini can read it as a new task. Seven words in a spec file were enough to reprioritize everything. Not a hallucination. Not a bug. The original system prompt losing to whatever arrived most recently in context.
No compaction or adaptive-thinking equivalent. Claude has server-side context compaction. Gemini doesn't. As tool chains grow, early instructions progressively dilute. There's no infrastructure layer keeping the original frame anchored.
This is a reliability gap, not a capability gap. Gemini 3.1 Pro is plenty capable. Capable and disciplined just aren't the same thing in a tool chain.
Round 2: Claude Sonnet 4.6 — five clean steps#
I ran the identical pipeline with Claude Sonnet 4.6. All five steps, no intervention. Jira read, Confluence pulled, web search ran, stack traces analyzed, PR opened with correct scope. After each tool call it returned to the original task instructions before deciding the next step.
Sonnet 4.6 treats tool results as inputs, not directives. That's the entire difference.
On paper, Sonnet 4.6 doesn't look like the obvious winner. Gemini beats it on GPQA Diamond and ARC-AGI-2. But on GDPval-AA — the Artificial Analysis benchmark that measures expert-task consistency across knowledge work — Sonnet scores 1,633 vs Gemini's 1,317. A 316-point Elo gap. That's not a marginal difference; that's a different reliability tier. Anthropic reports that Replit measured a 0% error rate on their internal code editing benchmark with Sonnet 4.6 (vendor-reported, take with the standard grain of salt). What I can confirm from my own runs: the model doesn't break mid-task.
Sonnet 4.6 is the production workhorse. If you only have budget for one default model behind your agent stack, this is the one.
Round 3: Claude Opus 4.6 — and a behavior nobody benchmarks#
Opus 4.6 ran the same pipeline. All five steps. Correctly-scoped PR on the first run. Every time.
But the interesting thing happened before the pipeline started.
When I gave Opus a slightly underspecified version of the ticket — vague acceptance criteria, ambiguous about which service to touch — it paused and asked. Not filler questions. The questions that actually shape what gets built: Is the business planning to scale soon, or is MVP fine with known limits? Velocity or cost? New service boundary or extend what's there? The answers determined the design. Opus surfaced them upfront.
Gemini 3.1 Pro under the same conditions just starts. It interprets, assumes, branches in multiple directions, and second-guesses mid-task. In a single-shot task, fine. In an agent taking real actions, that's how you end up with work done on the wrong scope.
There's also a piece of the UX nobody writes about. When Gemini asks a clarifying question, it's a wall of text — you read it, parse what it actually wants, type a response. When Opus uses Claude's structured ask_question tool, it pops a dialogue with choices you click. Same question, a third of the friction. Across a complex session with five or six decision points, that compounds.
Opus 4.6 also has the things that matter at scale: 1M context window at standard pricing (no long-context premium since March 13, 2026), server-side context compaction, adaptive thinking. It hits 65.4% on Terminal-Bench 2.0 — at the time, the highest of any production model — and 80.8% on SWE-bench Verified. On τ2-bench (agentic tool use), the DeepMind model card shows it essentially tied with Gemini 3.1 Pro: 99.3% on Telecom for both, Opus 91.9% vs Gemini 90.8% on Retail.
This isn't the model you use for everything. This is the model you reach for when the pipeline can't fail.
Round 4: GPT-5.3-Codex — fast, polished, context-rots#
Both OpenAI and Anthropic launched on February 5, 2026 — same day, deliberate broadside. Codex 5.3 is real. It's about 25% faster than 5.2 (OpenAI's own number), it leads SWE-bench Pro at launch, and it produces noticeably more polished web-dev output than either Claude or Gemini if you ask it for a landing page or a UI component. If your workload is rapid front-end iteration, Codex has a real edge.
But put it in the same five-step pipeline and the cracks show.
I ran a four-hour refactor on Opus 4.5 last year — about 200K tokens of context, autonomous from start to finish. It maintained the thread, tracked its own progress, produced working code. Codex started strong on the equivalent task and drifted after about 90 minutes — repeating work, forgetting earlier decisions, eventually requiring a manual reset. After roughly 100K tokens in a single session, output quality drops noticeably. The model is fast and it's smart, but it can't hold a long agentic workflow together the way Opus can.
Two other things you notice once you're running Codex at scale:
It over-engineers. In my comparison tests, Codex solutions averaged about 30% more lines than Opus for equivalent functionality. More lines means more bugs, more review burden, more maintenance debt. This isn't a benchmark — it's just what I saw across dozens of side-by-side runs.
It's locked into OpenAI's ecosystem. Codex works beautifully if you're already inside the ChatGPT / OpenAI tooling story. If you want fine-grained control — adaptive thinking, effort levels, context compaction, custom tool harnesses — the Anthropic API gives you handles Codex doesn't expose.
OpenAI reports Codex hits roughly 64.7% on Terminal-Bench 2.0 against Opus 4.6's 65.4% — a statistical tie. On the harder benchmarks Opus leads consistently. Opus 4.6 hits 68.8% on ARC-AGI-2; Codex doesn't publicly report a comparable number. BrowseComp goes to Opus at 84.0%; Codex doesn't engage that benchmark either. The honest read: SWE-bench Pro is the one place Codex clearly leads at launch, and that lead got erased in April when Opus 4.7 dropped (more on that in a minute).
Real agent tests, not press-release benchmarks
Production pipeline results, breaking-API audits, and honest model comparisons when new frontier models ship.
The benchmark sidebar — and the one number you can't take at face value#
I've been weaving benchmarks into the story rather than dropping a giant table on you, because the table is what every other comparison post does and it tells you almost nothing. But three numbers are worth pulling out specifically.
SWE-bench Pro (multi-language, contamination-resistant). Opus 4.6: 53.4%. Opus 4.7: 64.3% — an unusually large single-version jump. Codex 5.3 led SWE-bench Pro at launch in February; the Opus 4.7 release in April flipped that.
Terminal-Bench 2.0 (agentic terminal coding). Opus 4.6 at 65.4%, Gemini 3.1 Pro at 68.5% on the same Terminus-2 harness (DeepMind model card). Opus 4.7 moves to 69.4%. Credit where it's due: Gemini was briefly the leader on this specific benchmark, and Opus 4.7 just took it back.
MRCR v2 at 1M tokens. This is the number to look at sideways. The benchmark hides 8 needles across the full context window and tests whether the model can find and reason about all of them. At 128K tokens, Gemini 3.1 Pro and Opus 4.6 are basically tied — Gemini 84.9% vs Claude 84.0% on the DeepMind card. At 1M tokens, Google's own card shows Gemini collapsing to 26.3%. Anthropic claims Claude holds at 76% at 1M, but this number has not been independently verified at 1M on a shared harness. The direction of the gap is real and consistent across what I see in production. The exact magnitude — 76% vs 26.3% — should be treated as the headline rather than the literal truth, because no neutral third party has reproduced both at 1M on the same setup.
Interestingly, Gemini 3.1 Flash-Lite — Google's cheap tier — scores 60.1% on MRCR v2 at 1M, outperforming the much more expensive Gemini 3.1 Pro. If you specifically need cheap retrieval at 1M, the Flash-Lite line is a better Gemini choice than Pro.
The takeaway isn't "Claude wins, Gemini loses." It's that benchmark wins at 128K don't generalize to 1M, and the model you pick for a 50K-token chat is not necessarily the model you should pick for a 600K-token agent.
What Opus 4.7 changes inside the same pipeline#
Anthropic shipped Opus 4.7 on April 16, 2026. Same pricing as 4.6 ($5 input / $25 output per million tokens), same 1M context window at standard pricing. But the model is different enough that I had to re-run the pipeline.
The coding gains are real and large. SWE-bench Pro went from 53.4 to 64.3 (+10.9). SWE-bench Verified from 80.8 to 87.6 (+6.8). Terminal-Bench 2.0 from 65.4 to 69.4 (+4.0). For the kind of agentic coding work this pillar is about, those are the relevant numbers and they all moved in the right direction. Anthropic also points to early-tester reports from Cursor, GitHub, and Rakuten that suggest similar lifts in production, though those vendor-attributed numbers don't have independent announcement URLs I can link, so treat them as directional rather than definitive.
The sleeper upgrade is vision. CharXiv (no tools) jumped from 69.1% to 82.1% — a +13.0 swing, the biggest delta on any benchmark in the launch. The max image resolution tripled to 2,576px, and coordinates now map 1:1 with actual pixels, which eliminates a whole class of scale-factor math computer-use agents had been doing by hand. If you run screenshot-driven agents, diagram extraction, or dense-UI parsing, this is the real story of 4.7.
The one thing that regressed: BrowseComp (agentic web search) dropped from 83.7% to 79.3%, −4.4 points. If your production workflow is search-heavy — browsing, research, multi-site information gathering — that workload is a legitimate reason to stay on 4.6 for that specific agent.
Four breaking API changes will bite you on migration day. I'll keep this short because the details are tedious but the headline matters:
- Extended thinking budgets are gone. Sending
thinking: {type: "enabled", budget_tokens: N}returns 400. Adaptive thinking is now the only mode, and it's off by default. - Non-default
temperature,top_p,top_kreturn 400. Strip them from your requests and control behavior through prompting. - Thinking content is hidden by default. If your product displays reasoning, opt back in with
display: "summarized"or it'll look like the model is paused. - New tokenizer uses 1.0–1.35× more tokens. Same prompt, more tokens, same per-token price. Budget headroom needs to grow.
There's also a new xhigh effort level sitting between high and max. Claude Code already flipped its default to xhigh across all plans. If you use Claude Code, you're probably already on Opus 4.7 with xhigh defaults without realizing it. The Hex team reportedly found that low-effort 4.7 is roughly equivalent to medium-effort 4.6 — useful if cost control matters more than peak capability.
The migration calculus: if you're doing agentic coding, migrate this week. If your workload is BrowseComp-heavy, wait. If you can't afford two hours to audit the four breaking changes, wait. Anthropic's own chart also shows their unreleased Mythos Preview above 4.7 on virtually every benchmark — SWE-Pro 77.8 vs 64.3, SWE-Verified 93.9 vs 87.6, CharXiv 86.1 vs 82.1. Don't over-invest in 4.7-specific optimizations if your roadmap runs quarterly. 4.7 is a release floor for Mythos-class work, not a frontier destination.
The 1M context window question: who actually uses it inside the window#
Every frontier model now claims 1M tokens. Gemini was first. Anthropic caught up in March 2026 with no long-context premium. OpenAI has 400K on Codex 5.3, which sounds smaller but is still generous.
The interesting question isn't who has the biggest window. It's what the model does inside it.
For Claude, "what it does inside" includes context compaction. As the conversation approaches 1M tokens, the system automatically summarizes earlier content to make room for new information. You don't hit a wall and start over. The model gracefully degrades older context while preserving the most relevant pieces. Gemini doesn't have this. When you hit the limit on Gemini, you hit the limit.
For Gemini, "what it does inside" includes that MRCR v2 collapse I mentioned. The retrieval quality at 1M just isn't there yet, regardless of what the window size claims.
There's also a practical warning that nobody else seems to be writing about: 1M context burns through your Claude Max subscription limits fast. Even on the $200/month Max 20x plan, a single long Opus session at high context can eat a significant chunk of your weekly allowance. You're sending 500K–900K tokens per request instead of 50K–100K. That's 10x the consumption per interaction. I've hit the weekly limit mid-week more than once since switching to 1M context. Save the full window for tasks that genuinely need it — large refactors, cross-file debugging, architecture decisions across a whole service. Don't load your entire codebase to fix a one-file typo.
If you want a more thorough framework for matching model to task, I wrote one separately in how to identify the best model for your work. The TL;DR is: leaderboards narrow your candidate set, real workload A/B testing picks the winner.
Pricing reality: effective cost per completed task, not per million tokens#
This is the part where most comparison posts lose the plot. They line up the per-token prices, declare a winner, and move on. The actual question is what it costs you to finish a task, including the retries when the cheaper model gets it wrong.
The headline numbers:
- Claude Opus 4.7 / 4.6: $5 input / $25 output per 1M tokens. Flat at 1M. Cached input $0.50/1M.
- Claude Sonnet 4.6: $3 / $15 per 1M (≤200K), $6 / $22.50 above 200K. Cached input $0.30/1M.
- Gemini 3.1 Pro: $2 / $12 per 1M for prompts under 200K, $4 / $18 above 200K. Cached input $0.20/1M under 200K, $0.40/1M above.
- GPT-5.3-Codex: premium tier comparable to Opus, with the speed advantage baked in.
A full 1M-context Opus request costs $5.00 standard, $0.50 cached. A full 1M-context Gemini Pro request lands at about $3.60 with the tiered pricing kicking in above 200K. That's roughly 28% cheaper, not the 2.5x cheaper the headline pricing suggests.
Now factor in completion rate. If an Opus agent finishes the task in one run while a cheaper model needs a retry plus a human cleaning up after a wrong-scope PR, the effective cost flips entirely. Engineering time is dramatically more expensive than tokens. A $5 task that ships beats a $1 task that needs a $200 cleanup.
For the full per-token math across every tier (including the cheap end where Gemini Flash-Lite is genuinely interesting), see the LLMx pricing tracker. For the subscription side of the same question — Claude Max 20x at $200, Max 5x at $100, the trap that is Claude Pro at $20 — I wrote a full tier-by-tier guide in best value LLM subscriptions in 2026.
Decision matrix: by workload type, with explicit "wait" cases#
After all that, here's what I actually do. This is the routing logic, not the leaderboard:
Long-running agentic coding pipelines (anything where the model has to hold context across multiple tool calls and self-correct):
- Default: Claude Opus 4.7 (or 4.6 if you can't afford the migration audit this week).
- Cost-conscious alternative: Claude Sonnet 4.6 — completes the same pipeline reliably, at 60% of the price.
- Avoid: Gemini 3.1 Pro. The instruction-hierarchy drift kills you in production.
Single-shot code generation, well-defined contained problems:
- Default: Whichever frontier model you're already paying for. They're all good at this.
- Polish edge: Codex 5.3 if it's front-end / web-dev work.
- Cost edge: Gemini 3.1 Pro at $2/$12 below 200K.
Long-context document analysis (single-turn, no tool chain):
- Default: Gemini 3.1 Pro for multimodal or budget cases (the tiered pricing still works in your favor below 200K).
- Premium: Claude Opus 4.7 if you need the reasoning quality to actually use what's in the window.
- Cheap-and-cheerful for 1M retrieval: Gemini Flash-Lite — outperforms Gemini Pro on MRCR v2 at 1M, costs a quarter per 1M-token request.
Multimodal (video, audio, image-heavy):
- Default: Gemini 3.1 Pro. Claude can't process video natively.
- Exception: Pure screenshot / dense-UI parsing on Opus 4.7, which got a real vision upgrade with the 2,576px ceiling.
Vision-heavy agentic computer use (screenshots, dense UIs, diagram extraction):
- Default: Claude Opus 4.7. The +13.0 CharXiv jump and the 1:1 pixel coordinate mapping make this the meaningful upgrade in this release.
BrowseComp-style agentic web search:
- Default: Stay on Opus 4.6. 4.7 regressed 4.4 points on this specific benchmark. Migrate when 4.8 lands or earlier if Mythos ships.
When to explicitly wait before migrating or switching models:
- You can't afford the two-hour audit for the four breaking Opus 4.7 API changes this week.
- Your production budget can't absorb a 1.0–1.35× tokenizer inflation without a review cycle.
- Mythos-class capabilities are on your next-quarter roadmap. 4.7 is a waypoint, not a destination. Don't build 4.7-specific optimization scaffolding you'll just have to redo.
- You haven't actually run the agent through a real five-step pipeline yet. Leaderboard rankings will lie to you about how a model handles your specific workflow. Test it on your own work for at least a week before committing.
For the standalone-tool side of the Claude story — the new Claude Design prototyping canvas that depends on Opus 4.7 underneath — that's the most interesting non-coding launch this quarter and worth knowing about if you're already in the Anthropic ecosystem.
Frequently Asked Questions

Written by
AI engineer writing about agentic systems, MCP integration, and LLM comparisons. 10+ years building production software, 4+ focused on AI.
About Dmytro →Enjoyed this post?
Find out which LLM is cheapest for your use case — I test new models as they launch
No spam, unsubscribe anytime.


