Apple Just Called BS on AI “Reasoning”, Here’s Why

You’ve probably seen the demo that looks like magic: a model writes a full page of "thoughts," and then nails the answer.
But Apple walked in with a bucket of cold water to ask a blunt question: is that real thinking, or just a staged performance?
First, let's look at the setup. Apple didn't rely on the usual brain teasers and academic benchmarks that models are often trained on. Instead, their team built small, controllable puzzle worlds (think Towers of Hanoi, river crossings, or blocks-style planning) where they could crank up the difficulty one click at a time. They didn't just check the final answer; they checked every step along the way and kept the playing field completely fair between models. In short, their goal was to measure the work, not just the show.
And here’s what they found. As the puzzles get harder, accuracy doesn't just fade. It drops off a cliff. The performance curve doesn't gently slope; it snaps. That's a headline that traveled fast for a reason.
Then came the twist. Right when the problems get really gnarly, the amount of text these models write to "think" actually goes down, even when they have more processing power available. It looks like they're giving up early, precisely when grit should matter most.
Apple also mapped out the territory pretty clearly. On easy problems, simpler models often win. In the middle zone, the "think out loud" approach helps. But at the hard end of the spectrum, everything cracks. Thinking longer isn't a magic ladder; it's a tool that has a specific comfort zone.
But the story doesn't end there. A response paper fired back, arguing that the cliff isn't a failure of AI "thinking" at all, but a failure of the test itself. If you force a model to print out hundreds of step-by-step moves, you might just be testing how much it can write, not how well it can reason. When they let the models answer in a compact way (for example, by writing a tiny program), performance on the hard puzzles rebounded. This shifted the whole debate from "Can they think?" to "Are we even testing the right thing?"
So, what's the honest takeaway once the noise dies down? Apple proved that the appearance of reasoning is easy to fake and just as easy to break, unless you measure it with real complexity and check the work. The rebuttal reminds us to separate a model's inability to think from its inability to write out an incredibly long answer. Both of these truths can exist in the same room.
In the end, this isn't a eulogy for AI. It's a much-needed reset. It asks us to treat the "think-out-loud" process as theater lighting, not the actual stagecraft. When the lights are bright with clean tasks, fair comparisons, and verified steps, the performance looks very different. That isn't bad news. It's the beginning of better tests, clearer claims, and models we can actually learn to trust.
Further reading: You can find Apple’s research page and paper, the comment arguing the test design caused the cliff, and the mainstream coverage that captured the initial shock.
Y. Anush Reddy
Y. Anush Reddy is a contributor to this blog.