Claude Opus 4.7 is Anthropic’s new powerhouse for serious software work

Anthropic’s new Claude Opus 4.7 feels less like a routine model refresh and more like Anthropic quietly saying: “Go ahead, give your ugliest, hairiest engineering problems to an AI – it can probably handle them now.” For software teams that already live in IDEs, CI/CD dashboards, and log streams all day, this release is aimed squarely at the work that used to be “too risky to fully trust a model with.”

Claude Opus 4.7 is positioned as Anthropic’s most capable generally available model, with a clear emphasis on advanced software engineering rather than broad, flashy consumer tricks. Anthropic says it is a “notable improvement” over Opus 4.6 on the hardest coding tasks, and early partner data backs that up: on demanding benchmarks like SWE-bench Pro and CursorBench, Opus 4.7 posts double-digit jumps that move it from “strong” to “best-in-class” territory for coding and agentic workflows.

The headline story here is autonomy. Anthropic’s own write-up highlights that teams are now confident handing off their hardest coding work – the kind that previously needed careful human babysitting – directly to Opus 4.7. Partners echo this in more concrete terms. Cursor reports a leap from 58 percent to 70 percent on its CursorBench internal benchmark, a test tuned for real coding work inside its IDE, while Anthropic’s own numbers show SWE-bench Pro jumping from around 53 percent on Opus 4.6 to 64.3 percent on Opus 4.7, and SWE-bench Verified rising from 80.8 percent to 87.6 percent. That means the model is actually landing more real bug fixes and production-grade patches, not just writing plausible snippets.

Crucially, that lift isn’t just about raw intelligence; it’s about staying power. Anthropic and its partners repeatedly emphasize long-running, multi-step workflows: Opus 4.7 sticks with a complex task for hours, pushes through failures, and keeps executing across tools and systems instead of stalling out. Companies building agents and devtools describe it as a model that no longer needs step-by-step handholding. In Devin, a well-known autonomous coding agent, Opus 4.7 reportedly works “coherently for hours” and unlocks deeper investigation work that earlier models struggled to run reliably. In some cases, it’s not just fixing bugs but redesigning systems: partners mention the model autonomously building a full Rust text-to-speech engine from scratch – including neural model, optimized kernels, and a browser demo – and then piping its own output through a speech recognizer to validate it against a Python reference implementation.

That self-checking behavior is one of the more interesting qualitative shifts. Anthropic notes that Opus 4.7 now “devises ways to verify its own outputs before reporting back,” which is exactly the kind of capability teams want when they’re letting a model edit core infrastructure or financial pipelines. Vercel’s engineering leadership, for example, highlights that the model will sometimes effectively do proofs on systems code before it starts making changes, a behavior they say they hadn’t seen in prior Claude models. Other partners describe a noticeable drop in “wrapper noise” – fewer meaningless helper functions and scaffolding, and more focused, production-ready code.

If you zoom out, Opus 4.7’s coding improvements show up particularly starkly in the agentic and tooling-heavy scenarios that have become the industry’s obsession. Anthropic’s partners report that on a 93-task coding benchmark, Opus 4.7 improved resolution by 13 percent over Opus 4.6 and solved four tasks that not only Opus 4.6 but even Sonnet 4.6 could not crack. On Rakuten’s internal SWE-bench implementation, the new model resolves roughly three times more production tasks than its predecessor. Notion says tool errors in their multi‑step workflows dropped to about one-third of previous rates, while still delivering a 14 percent bump in task success. Qodo and CodeRabbit, both focused on code review and quality, report double‑digit gains in recall with stable precision – in practice, that means the model is surfacing more real bugs without spamming engineers with false alarms.

There is also a subtle but important shift in how Opus 4.7 behaves as a collaborator. Several partners mention that the model “pushes back” more in technical discussions instead of eagerly agreeing with every suggestion. For senior engineers used to rubber‑stamp copilots, this is a meaningful change: the model is more willing to point out flaws in a design, insist on safer patterns, or decline to proceed when key details are missing. At Hex, a data-focused company, Opus 4.7 reportedly refuses to fabricate numbers when source data is absent, where earlier models were more likely to fill in the gaps with plausible-sounding guesses. That kind of behavior is unglamorous, but it’s exactly what you want when models are touching dashboards, financial reports, or legal documents.

Vision is another area where Opus 4.7 quietly takes a generational step. Anthropic has bumped the model’s image resolution limit to roughly 2,576 pixels on the long edge, about 3.75 megapixels – more than three times the fidelity of earlier Claude versions. Internal and partner benchmarks suggest that the payoff is real: in visual acuity tests used for autonomous penetration-testing workflows, Opus 4.7 jumps from around 54.5 percent to 98.5 percent, essentially eliminating their single biggest pain point in having the model “see” complex screens and interfaces. Life-sciences-focused customers report that it can now reliably read chemical structures and interpret dense technical diagrams, which are far from easy pattern‑recognition tasks.

On the “knowledge work meets engineering” side, Opus 4.7 also shows strong gains. Anthropic calls out state-of-the-art performance on GDPval-AA, a third‑party benchmark that measures economically valuable work across finance, legal, and other domains. In practice, that translates into better document analysis, modeling, and presentation building. Financial firms testing the model say it now acts more like a junior analyst who can produce rigorous models and cross‑linked decks, rather than just summarizing PDFs. For legal workflows, companies like Harvey report over 90 percent substantive accuracy on their BigLaw Bench evaluation, with smarter handling of ambiguous editing tasks and fine‑grained contract clauses that historically trip up frontier models.

All of this capability comes with some new knobs for developers. Anthropic is introducing an “xhigh” effort level between “high” and “max,” meant to give teams finer control over the trade-off between latency and depth of reasoning on hard problems. In Claude Code, xhigh is now the default for all plans, and Anthropic explicitly recommends starting with high or xhigh for coding and agentic cases. The model also uses a new tokenizer that, according to Anthropic, maps many inputs to roughly 1.0–1.35 times as many tokens as before, while thinking more aggressively at higher effort levels, especially on later turns in long agent runs. That combination can increase token counts in some scenarios, but Anthropic’s internal tests suggest that overall efficiency – measured as solved tasks per token – actually improves for coding workflows.

Developers also get “task budgets” in public beta, letting them set soft ceilings on token spend for long-running jobs and giving the model guidance on how to allocate its effort over time. For teams nervous about runaway agents racking up massive bills, this is a practical compromise: you still get the long-horizon reasoning, but within clear, enforceable bounds. Under the hood, the model is also better at using file-system-like memory, remembering key notes and artifacts across multi-session work so it can pick up new tasks without needing full context replayed every time.

Pricing is deliberately boring, in a good way. Opus 4.7 comes in at the same rate as Opus 4.6 – $5 per million input tokens and $25 per million output tokens – and rolls out everywhere Anthropic already lives: the Claude app, the Claude API, Amazon Bedrock, Google Cloud’s Vertex AI, and Microsoft’s Foundry program. That means teams that already wired Opus 4.6 into their stack get an essentially drop-in upgrade, albeit with a few migration caveats. Anthropic warns that because instruction-following is now much stricter, prompts that relied on earlier models being a bit “loose” may produce surprising behavior; the model is more literal and less likely to quietly skip parts of a request. In other words, you may need to clean up vague or overloaded prompts that worked by accident before.

On safety and security, Opus 4.7 sits in a deliberate middle ground. Anthropic is clear that this model is less broadly capable than Claude Mythos Preview, its cutting-edge but tightly controlled cyber-focused model, which launched to select partners under the Project Glasswing banner. During training, Anthropic experimented with techniques to dial back Opus 4.7’s offensive cyber capabilities while still keeping it useful for legitimate security work. Out of the box, the model ships with automatic safeguards that block clearly prohibited or high‑risk cybersecurity use cases. For red teamers and security researchers who do need access to its full defensive capabilities, Anthropic is introducing a Cyber Verification Program to vet and onboard those use cases more carefully.

More broadly, Anthropic says Opus 4.7’s safety profile looks similar to Opus 4.6, with low rates of worrying behaviors like deception, sycophancy, or cooperation with misuse in their internal audits. On some fronts – honesty and resistance to prompt‑injection attacks – Opus 4.7 is described as a modest upgrade; on others, like giving overly detailed harm‑reduction advice around controlled substances, it is slightly weaker. Anthropic’s internal alignment verdict is that the model is “largely well-aligned and trustworthy, though not fully ideal,” with Mythos Preview still standing as their best-aligned system overall. For enterprises, that nuanced picture matters: this is a model you can point at real production workflows today, but it still requires the usual guardrails and policy work on top.