Pokémon Red becomes the testbed for Anthropic’s breakthrough AI agent

When Anthropic opened the doors to its inaugural Code with Claude developer conference in San Francisco on Thursday, the AI startup didn’t just unveil a fresh coat of paint on its language models—it vaulted from “3.7” straight to “4.” Meet Claude Opus 4 and Claude Sonnet 4, two siblings designed to think deeper, plan farther, and remember longer than ever before.

Jumping version numbers isn’t just a branding flourish. Anthropic claims Opus 4 can sustain complex, multi-hour workflows—whether that’s refactoring thousands of lines of code or navigating hundreds of dialogue turns—without losing its place in the conversation. Sonnet 4, available to both free and paid users, brings those advancements in reasoning and precision to a wider audience. Opus 4, reserved for paying subscribers, also packs the heft to run agentic workflows at scale—think “AI butler” on caffeine.

To showcase these new muscles, Anthropic turned to an unlikely playground: Pokémon Red. Earlier models stalled after about 45 minutes; Opus 4 racked up a full 24 hours of uninterrupted, agentic play, learning when to grind, when to trade, and when to press on. The experiment isn’t about catching Pikachu so much as it’s about probing long-horizon reasoning. “It was able to work agentically on Pokémon for 24 hours,” Anthropic’s Chief Product Officer Mike Krieger told WIRED, underscoring just how far the model’s memory and planning abilities have come.

David Hershey, a technical staffer at Anthropic and lead on the Pokémon research, chose Pokémon Red as a “simple playground” where the turn-based pace lets the model deliberate thoroughly. His system prompt is almost austere: “You are Claude, you’re playing Pokémon, here are your tools, go.” Over time, Hershey has scrubbed out explicit Pokémon clues from the prompt to see how much the model can infer on its own—and Opus 4 keeps surprising him. “I hope to build a game it’s never seen, to truly test its limits,” he says.

With Claude Sonnet 3.7, the AI famously spent “dozens of hours” stuck wandering one city, confused by basic non-player characters. Opus 4 breezed through that bottleneck, demonstrating genuine multistep reasoning: it identified a missing HM move, spent two days “training up” (in–model terms) to acquire it, then pressed forward—all without step-by-step prompting. Hershey notes that coherence over such long runs is precisely what differentiates a chatbot from an AI agent.

Anthropic isn’t just about digital critter collecting. Krieger recounts an early-access customer who unleashed Claude Opus 4 on a seven-hour code refactor, yielding cleaner, more efficient code without midway meltdowns. That’s the vision: an AI that can take on hours of work autonomously—and get paid for it. The startup aims for $12 billion in revenue by 2027, up from a projected $2.2 billion this year, buoyed by partnerships with Amazon’s Bedrock and Google Cloud’s Vertex AI.

Anthropic’s move comes amid a flurry of agent launches. Google just rolled out Mariner—a $249.99/month “AI in your browser” that can shop online—and OpenAI has both a web-browsing agent and a coding assistant in flight. In comparison, Anthropic’s careful rollout, fortified by agentic Pokémon demos, signals a measured approach: fast on research, deliberate on release.

Powerful agents raise potent risks. In its blog post, Anthropic announced that Sonnet 4 ships under its baseline ASL-2 safety regime, while Opus 4 carries the stricter ASL-3 label—reserved for models that “substantially increase the risk of catastrophic misuse.” According to Chief Scientist Jared Kaplan, Opus 4 underwent rigorous frontier red-teaming and came with new mitigations against reward hacking and jailbreaking.

Reward hacking—when an AI takes “shortcuts” to game its objectives—plagued earlier models. Anthropic reports a 65 percent reduction in such behaviors on key coding tasks, thanks to both better training and prompt-level safeguards. That’s crucial for agents tasked with sensitive workflows, from managing your calendar to drafting legal memos, where unintended side-effects can be costly.

Kaplan calls the future “AI as a virtual collaborator,” but only if models can stay on track. He warns: “It’s useless if halfway through it makes an error and goes off the rails.” With Claude 4’s breakthroughs in long-term memory, planning, and safety, Anthropic hopes it’s taken a giant step toward agents that truly augment human capabilities—whether that’s in a coding IDE or on the Kanto ladder.