OpenAI chooses Cerebras for ultra-fast AI inference

OpenAI’s partnership with Cerebras is essentially a bet that the future of AI will be real-time, always-on, and limited less by GPUs and more by electricity and cooling. It is about taking the kinds of models people already use every day and making them feel as responsive as a live conversation or a local app, even at a massive global scale.

At the heart of the deal is a huge number: 750 megawatts of ultra low-latency AI compute that Cerebras will dedicate to running OpenAI’s models. That capacity will be rolled out in multiple phases through 2028, making this one of the largest high-speed AI inference deployments announced so far. Unlike a typical GPU cluster stitched together from thousands of cards, Cerebras builds systems around a “wafer-scale engine” – a single, giant chip the size of an entire silicon wafer, with compute, memory, and bandwidth living side by side. By keeping everything on one enormous piece of silicon instead of hopping across a network of discrete accelerators, Cerebras cuts out many of the latency bottlenecks that slow traditional AI inference.

This is exactly the pain point OpenAI wants to address. Today, when a user asks a complicated question, generates code, or kicks off an AI agent, there is a multi-step dance behind the scenes: the request travels to a data center, the model runs across multiple machines, results are stitched together, and then streamed back. That process works, but it is not always instant, especially at peak demand or with dense workloads like code generation and long-form reasoning. OpenAI describes its overall compute strategy as building a “resilient portfolio” that matches different workloads to the hardware that makes the most sense for them, and Cerebras is being slotted in as a dedicated low-latency inference tier. In practical terms, that means certain classes of prompts – the ones where every millisecond matters for user experience – can be routed to this faster Cerebras-backed layer.

The companies have been circling each other for years. Cerebras has pitched itself as an alternative to GPU-bound AI infrastructure, claiming that its wafer-scale systems can run large language models at speeds up to an order of magnitude faster than conventional GPU setups for some workloads. Early benchmarks on Cerebras hardware, including models from the Llama family, show token generation rates that significantly outpace many GPU-based deployments, which is exactly the sort of improvement OpenAI needs for “always-on” assistants, live coding copilots, and real-time agents. For Cerebras, this deal is a validation moment: its CEO, Andrew Feldman, framed it as a decade-long journey culminating in a multi-year agreement that could push wafer-scale technology into the hands of hundreds of millions, and eventually billions, of users.

There is also a bigger context here: OpenAI is quietly building out a vast physical footprint to feed its models’ hunger for power and cooling. The Cerebras announcement lands alongside a separate partnership with SB Energy, backed by SoftBank, that involves a $1 billion investment to build and operate a 1.2 gigawatt AI data center campus in Milam County, Texas, powered by new solar and battery storage. A gigawatt is enough electricity to power roughly three-quarters of a million US homes at any given moment, which gives a sense of the scale of the facilities needed to run next-generation AI systems. When you start to pair that kind of renewable-heavy power infrastructure with 750 MW of specialized inference hardware, you begin to see how seriously OpenAI is treating AI as critical infrastructure rather than just cloud software.

For users, many of these moves will show up in subtle ways before they’re obvious headline features. Interfaces that used to stutter or lag may start to feel “local” even when they are calling massive models over the network. Latency-sensitive use cases – think real-time customer support, multiplayer gaming assistants, trading and risk agents, or live language translation – stand to benefit the most from a dedicated low-latency tier. In that sense, Cerebras plays a similar role to the early broadband providers of the web era: most people will never see the wafer-scale chips themselves, but they will notice when AI stops feeling like a slow remote service and starts behaving like a native part of everything they do online.