Claude Opus 4.7 Is Better. That's Exactly the Problem.

Anthropic's latest Opus put up stronger coding scores and sharper evals. The backlash says the real question was never whether it improved, but what kind of improvement people were actually getting.
At first, Opus 4.7 got the standard frontier-model rollout. Chart, gains, partner praise, the whole reassuring sequence that tries to close the story before most people have touched the product. Anthropic had real numbers. SWE-bench Pro jumped from 53.4 to 64.3. But the number that actually said something was the one that went the other way. BrowseComp, the benchmark closest to general-purpose web research, slipped from 83.7 to 79.3. Not a catastrophic drop. Just a direction.
Users do not meet a model through a benchmark table. They meet it through drag.
Anthropic's migration and prompting docs make that drag easier to read. Opus 4.7 is described as more direct than 4.6, more opinionated, less warm. Fewer emoji. Less likely to generalize from one example to the next unless you spell it out. It also tends to use fewer subagents by default — less initiative, basically.
The reasoning layer changed too, in a way that put more distance between the user and whatever the model was actually doing.
Also Read: Anthropic’s $100 Billion AWS Bet Comes With a Price
Claude 4 models now return a summarized version of their thinking by default rather than the full trace, and on Opus 4.7, adaptive thinking is the only supported mode. All that together makes it sound like a personality change rather than a software update.
The most measurable part is the tokenizer. Anthropic says the same prompt can now consume up to 35 percent more tokens than before — no pricing adjustment, no prominent warning, just a quiet change that shows up on the bill. That's why cost was the first complaint and the loudest one. But the tokenizer was only the most visible piece. The literal instruction-following is why rewrites started landing slightly off with less filling in, less reading between lines, exactly what you asked for and nothing extra. The hidden reasoning is why some users felt like they were getting answers from behind glass rather than working alongside something. None of it was accidental. It was a different set of priorities, shipped without much explanation.
Also Read: Why Anthropic’s Mythos Meeting Was About More Than AI
Anthropic's own launch post makes it explicit, maybe without meaning to. The company states that Opus 4.7 is not its most capable model — that's Mythos Preview, which has stronger cyber capabilities and is being kept on limited release while Anthropic tests new safeguards on less capable models first. Anthropic even says it actively worked to reduce some of those cyber capabilities during Opus 4.7's training.
The broad public model is not the best one Anthropic can ship. It is the one the company is more comfortable managing.
So, not a smarter Claude but a more legible one, built for enterprise workflows that want rigor, predictability, fewer surprises. And Anthropic isn't alone in this. The same tension has been showing up across the frontier — labs posting stronger eval numbers while a quieter set of complaints builds around cost, texture, and whether the model actually feels useful in practice.
Opus 4.7 may be the clearest sign yet that those two scoreboards have stopped tracking each other.
Y. Anush Reddy is a contributor to this blog.



