Game Boy Pokémon Is Turning Into a Benchmark for AI Agents

The most advanced AI systems on the planet can write code, summarize legal documents, and explain complex science in seconds. Then you drop them into a pixelated 1990s videogame with Pikachu on the screen, and something strange happens. They don’t fail like a student who never studied. They fail like a worker who forgets the task midstream and keeps doing the same thing until someone steps in.
That failure mode is exactly why Pokémon has become a serious test. The Wall Street Journal reports that among top AI labs, Nintendo’s original Pokémon games are emerging as a way to track progress and figure out whether models can be deployed toward time-consuming, complex goals, not just quick answers.
The Twitch streams that turned Pokémon into a scoreboard
Anthropic’s “Claude Plays Pokémon” stream launched last February, built by David Hershey, an applied AI lead at Anthropic. In plain terms: Pokémon lets you see how a model is doing and “evaluate it in a quantitative way.”
The idea didn’t stay contained. Independent developers created “GPT Plays Pokémon” and “Gemini Plays Pokémon,” and they say those projects later received support from the labs themselves. And the audience behavior is part of the signal: it describes the streams as generating huge live engagement as the models chart their progress in real time.
Inside the companies, it’s turned into a cultural flex. Report says that OpenAI employees even kept a live GPT Pokémon stream running on a TV in the office for a time. And after Gemini’s completion, TechCrunch noted Sundar Pichai celebrated it publicly.
Why Pokémon stresses agents better than quiz-style benchmarks
Pokémon isn’t hard because the rules are difficult to memorize. It’s hard because it forces the exact things people want from agents in the real world.
You have to hold a goal in your head across hours. You have to navigate mazes and puzzles that punish mindless trial-and-error. You have to make decisions that only pay off later, like whether to train your current Pokémon or catch new ones. The WSJ quotes Carnegie Mellon professor Graham Neubig making the point cleanly: most benchmarks grade single answers, while Pokémon tracks reasoning and progress toward a goal over a long period.
TIME makes the same gap feel more human: these models often “know” what should happen, but their execution gets inconsistent over long sessions.
The failure mode is looping, not ignorance
A lot of people say these models “get stuck,” but that’s too gentle.
The failure mode is looping. The agent enters a room, leaves, forgets why it left, re-enters, and repeats the cycle for hours. In a game, looping is funny. In a business, looping is a financial leak, because the agent burns API credits and compute budget the entire time and delivers nothing back.
If you want a real-world analogy, imagine an AI employee that drafts the same email 200 times because it lost the goal state but kept running. Pokémon is that pattern, rendered in pixels.
The harness problem is the real story
Most readers hear “AI plays Pokémon” and picture a model looking at the screen like a human, interpreting pixels, and pressing buttons.
In practice, many setups don’t rely on pixels alone. Gemini’s milestone explicitly said it took “a little help.” Ars Technica went deeper on why that matters, describing how a harness can provide extra representations to aid navigation.
Some builds go further by feeding highly structured navigation support. Joel Zhang’s developer write-up describes a harness that can analyze map data and provide coordinates for reachable unexplored tiles.
Hershey built a memory system so Claude could keep track of important information it learned while playing, and that he now shares best practices from that work with customers.
Games have always tested AI, but Pokémon hit a nerve
Using games to evaluate AI isn’t new. Poker, chess, Go, and more recently games like Minecraft have all been used as testbeds, and the WSJ points to Pokémon as the one that captured the current moment.
If you want the “this is becoming official” signal, look at Kaggle. In August 2025, Kaggle launched Game Arena with an AI chess exhibition tournament as the kickoff. Kaggle’s own blog framed it as the inaugural event, and Chess.com’s coverage reported that OpenAI’s o3 won the tournament.
One detail changes the leaderboard conversation
Newer Claude versions have improved, and Claude Opus 4.5 is still trying live on Twitch. But comparisons are tricky for a simple reason: both GPT and Gemini have beaten the original Pokémon game, and the gap may have as much to do with the harness as with the underlying model.
Google and OpenAI models have moved on to Pokémon sequel games, citing Joel Zhang and Jonathan Verron, the freelance developers behind the Gemini and GPT streams. Verron’s quote is basically the thesis: Pokémon is “a perfect game for AI right now.”
What to watch next
If you want the real signal, don’t obsess over who finishes first. Watch what keeps breaking.
Does the agent recognize it is looping and reset its plan? Does it write things down and actually use them? Does it recover from a setback without wandering for an hour? Does it stay aligned with the goal after a long detour?
Pokémon became the benchmark because it exposes the one trait that separates a flashy demo from a trustworthy worker: reliable follow-through.
Y. Anush Reddy is a contributor to this blog.



