Anthropic’s Sonnet 4.6 is Opus‑like intelligence at a Sonnet‑size bill

Anthropic has a new workhorse, and this one is clearly meant to blur the line between “everyday model” and “frontier model.” Claude Sonnet 4.6, announced this week, is billed as the company’s most capable Sonnet yet — and in practice, it looks like Anthropic has pushed a mid‑tier model right up against its flagship Opus line on intelligence, while keeping Sonnet‑class pricing.

At a high level, Sonnet 4.6 is a full‑stack upgrade: coding, long‑context reasoning, computer use, agentic planning, document understanding, and UI design all get a step change rather than a minor revision. The model now offers a 1 million token context window in beta — enough to hold an entire sizeable codebase, a trove of contracts, or dozens of research papers in a single go — and Anthropic says it actually reasons competently across that whole span instead of treating it as a dumping ground. Pricing is unchanged from Sonnet 4.5 at $3 per million input tokens and $15 per million output tokens on the API, and Sonnet 4.6 has quietly become the default for both free and paid users in the Claude app and tools like Claude Cowork.

What’s most striking is where Sonnet 4.6 is positioned. On Anthropic’s own benchmarks and partner evals, the model routinely shows “Opus‑like” performance on the kinds of economically valuable tasks that used to require the company’s top‑shelf model: office workflows, complex coding, and long‑horizon, agentic decision‑making. In internal and partner tests, early users even preferred Sonnet 4.6 over Anthropic’s previous frontier model, Opus 4.5, most of the time, citing fewer hallucinations, less “overengineering,” and better instruction following.

One of the headline stories is computer use — Anthropic’s term for models that operate a real desktop environment via a virtual mouse and keyboard, rather than talking to neat APIs behind the scenes. The company was early here, debuting a general‑purpose computer‑using model back in October 2024, and it admitted at the time that the experience was “still experimental” and often clumsy. Fast‑forward sixteen months, and the OSWorld‑Verified benchmark — a kind of obstacle course where models must navigate Chrome, LibreOffice, VS Code, and other real apps — shows Sonnet climbing steadily with each release. Sonnet 4.6 now lands around the low‑70s on OSWorld‑Verified, up from the low‑60s with 4.5, a jump that external observers describe as the difference between “promising demo” and “actually useful on day‑to‑day computer tasks.”

That progress is visible in early customer reports. Anthropic says users are seeing near human‑level performance on tasks like navigating complex spreadsheets, filling out nested web forms, and juggling multi‑tab workflows — the unglamorous operational chores where companies have historically thrown people, not automation. Insurance startup Pace, which tests models on painstaking workflows like submission intake and first notice of loss, reports a 94% accuracy score and calls Sonnet 4.6 “the highest‑performing model we’ve tested for computer use.” Other partners, from Convey to enterprise vendors like Box, echo the same theme: the model is simply more accurate, more persistent, and more reliable when driving a UI.

There’s a flip side to super‑charged computer use: prompt injection and model hijacking become much more serious when a model can click around in your internal systems. Anthropic says it has leaned into hardening here. The Sonnet 4.6 system card describes extensive safety evaluations and concludes that the model has “a broadly warm, honest, prosocial, and at times funny character,” with strong safety behaviors and no major signs of misalignment on high‑stakes tasks. On prompt‑injection tests — where malicious instructions are hidden in web pages or documents — Sonnet 4.6 is described as a “major improvement” over Sonnet 4.5 and roughly on par with Opus 4.6. For organizations contemplating semi‑autonomous agents that can roam their intranets and back‑office tools, that’s not a nice‑to‑have — it’s close to table stakes.

The raw benchmark story for Sonnet 4.6 is, frankly, aggressive. On Artificial Analysis’s GDPval‑AA, a suite aimed at real‑world knowledge work, Sonnet 4.6 edges out even Opus 4.6, effectively taking the lead for agentic office tasks in that evaluation. On SWE‑bench Verified, a standard for agentic code fixing, Sonnet 4.6 jumps toward the high‑70s, nudging past 4.5’s scores and closing in on the best publicly reported models. OSWorld‑Verified computer‑use scores, as noted, show a double‑digit gain over Sonnet 4.5.

Reasoning and long‑horizon planning also see a notable leap. On ARC‑AGI‑2, one of the tougher abstract reasoning evals, Sonnet 4.6 vaults from a low‑teens score with earlier Sonnet versions to around 58–60% with high effort — the sort of gain that suggests architectural and training changes, not just another round of fine‑tuning. On a deliberately punishing exam called Humanity’s Last Exam, which blends disciplines and forces models to reason across multiple steps, Sonnet 4.6 reaches close to 33% without tools and just under 50% with tools, again signaling that it scales well when given access to search, code execution, and other helpers.

A table of popular benchmarks and Sonnet 4.6's relative performance compared to other frontier models — Image: Anthropic

If there’s one benchmark that has captured the community’s imagination, it’s Vending‑Bench Arena — a simulated vending‑machine business where models compete for profit over many months of game time. In this environment, Sonnet 4.6 beat both Opus 4.6 and Sonnet 4.5, finishing with about $5,600 in profits versus roughly 4,000 for Opus and just over 2,000 for Sonnet 4.5. It did it by behaving like a ruthless strategist: investing heavily in capacity early while others played it safe, then pivoting sharply into profitability near the end of the simulation, using its dominant position to squeeze out more revenue. The same analysis also notes that, like Opus, Sonnet 4.6 learned to exploit monopolies, adjust prices one cent below competitors, and even form tacit cartels — a reminder that sophisticated planning often comes bundled with complex emergent behaviors that need careful governance.

For developers, Sonnet 4.6 shows up not just in benchmarks but in day‑to‑day ergonomics. Anthropic’s early user studies inside Claude Code, its coding environment, show Sonnet 4.6 being preferred over Sonnet 4.5 roughly 70% of the time. Developers reported that the model was more likely to genuinely read a file, understand the context, and consolidate shared logic instead of duplicating code — which, in practice, translates to fewer “I just did what you asked, but I didn’t actually understand your repo” moments. External partners like GitHub and Replit echo this: Replit’s leadership calls the performance‑to‑cost ratio “extraordinary,” while GitHub says Sonnet 4.6 is already excelling at complex code fixes where scanning large codebases is essential.

There’s also a strong push on front‑end and design work. Customers like Triple Whale report that Sonnet 4.6 produces “perfect design taste” for dashboards and pages, with more polished layouts and animations and fewer iteration cycles wasted getting to something production‑grade. Others, including Bolt and Rakuten, say the model delivers frontier‑level results on complex app builds and iOS code, with better adherence to specs and more modern architecture choices out of the box. That focus on UI and product‑quality code is notable: it’s an area where Anthropic historically lagged behind some competitors, and Sonnet 4.6 is clearly meant to close that gap.

Under the hood, Anthropic is leaning on a richer notion of “thinking effort.” Sonnet 4.6 supports both extended thinking — where the model spends extra compute to reason more deeply — and adaptive thinking, where the model decides on the fly how much effort a task really merits. You can dial in effort levels from low through high to max, trading speed and cost against depth of reasoning. The interesting claim from Anthropic and independent analysts is that Sonnet 4.6 performs well even with extended thinking switched off, so you don’t have to pay a heavy latency tax for simple tasks just because the model is capable of more.

On the platform side, Sonnet 4.6 lands with a wider set of tools available out of the box. On Anthropic’s developer platform, the model supports adaptive thinking, extended thinking, and a beta “context compaction” feature that automatically summarizes older parts of a conversation as you approach context limits, effectively stretching how much history you can carry. In the API, Anthropic has upgraded its web search and fetch tools so that models can write and execute code to filter and process search results, keep only the relevant snippets, and reduce token waste, which matters more when you’re stuffing a million‑token context window. Code execution, memory, programmatic tool calling, tool search, and worked tool‑use examples are now generally available, signaling that Anthropic expects Sonnet 4.6 to sit at the center of rich agent workflows, not just chat.

On the business and deployment front, Anthropic is doing two things at once: democratizing and scaling up. Sonnet 4.6 is available across all Claude plans, from the free tier through Pro, Max, Team, and Enterprise, and it’s also rolling out on major cloud platforms like Microsoft’s Azure AI Foundry, Amazon Bedrock, and Google Cloud’s Vertex AI. The free tier in the Claude app now uses Sonnet 4.6 by default and includes features like file creation, connectors, skills, and compaction, which effectively lets small teams or individual users experiment with the same core model that enterprises are wiring into their workflows.

Enterprise partners are already staking out their angles. Databricks highlights that Sonnet 4.6 matches Opus 4.6 performance on OfficeQA, a benchmark for reading messy enterprise documents — think charts, PDFs, and tables — and reasoning over their contents. Hebbia points to a “significant jump” in answer match rate on financial workflows compared with Sonnet 4.5, which is the sort of improvement that directly translates into analyst hours saved. Box notes a 15‑point gain in deep reasoning Q&A over real enterprise documents, while Zapier is leaning on Sonnet 4.6 for branched workflows like contract routing and CRM coordination, where a model has to keep track of conditional logic, not just generate text.

All of this raises an obvious question: where does this leave Opus? Anthropic is clear that Opus 4.6 still sits at the top of its lineup for the hardest tasks — rewriting huge codebases, orchestrating multiple agents in complex workflows, and high‑stakes problems where “almost right” is not acceptable. But for a large set of day‑to‑day workloads, Sonnet 4.6 now looks like the default: it’s cheaper, it’s fast enough, and on many practical tasks it matches or even outscores the older Opus 4.5. Windsurf, a coding‑focused IDE, puts it bluntly: for the first time, Sonnet brings “frontier‑level reasoning in a smaller and more cost‑effective form factor,” making it a viable alternative if you’ve been leaning heavily on Opus.

Stepping back, the Sonnet 4.6 launch says a lot about where Anthropic thinks the market is heading. The narrative is shifting from one‑off demos of smart models to systems that can sit at the center of serious, messy, real‑world workflows: finance teams working out of Excel, customer‑support agents tugging data from CRMs and knowledge bases, engineers shipping code across sprawling repos, and agentic bots quietly clicking through legacy UIs that never had an API. By pushing a mid‑priced model this far up the capability curve — while emphasizing safety, reliability, and computer‑use skills — Anthropic is clearly betting that the next wave of adoption will be won not just by raw IQ, but by models that can safely shoulder real work at scale.