OpenAI Signs $10B Deal With Cerebras for AI Speed

The pause is becoming the product.
As AI models get “smarter,” they can feel slower in the moments people care about most, when the system has to reason, write a long answer, generate code, or run an AI agent. OpenAI’s newest infrastructure move is meant to make that waiting feel like it disappears.
OpenAI confirmed a multi-year deal to add 750 megawatts (MW) of ultra low-latency compute from Cerebras, delivered in stages through 2028. A source familiar with the contract told Reuters the agreement is worth more than $10 billion over its life.
Why 750MW matters in OpenAI’s Stargate roadmap
Power math that changes the story: 750MW = 0.75GW, that’s ~7.5% of Stargate’s 10GW commitment.
OpenAI has described Stargate as a 10-gigawatt commitment (the formal, announced buildout). Against that official target, 0.75GW is a big chunk, and it’s being brought online on a fast schedule, not “someday.”
There’s a bigger backdrop, too: Reuters has also reported Sam Altman publicly discussed an even larger 30GW ambition over the long run. But the reason this Cerebras deal hits the near-term roadmap is the 10GW number, because it’s the committed plan.
This isn’t just “cloud capacity” , it’s a physical buildout
This deal is being treated like real infrastructure, not a simple API subscription.
The Register reports Cerebras will take on the risk of building and leasing data centers specifically to serve OpenAI under the agreement. That supports the strategic read: OpenAI isn’t only buying compute, it’s carving out a dedicated, high-speed inference lane that can sit alongside traditional hyperscaler capacity.
And yes, the Microsoft angle is part of the subtext. OpenAI still uses a broad mix of infrastructure partners, but a dedicated inference lane helps reduce compute-dependency on any single provider for the specific workloads that define the ChatGPT experience, such as fast agent loops, real-time “reasoning,” and interactive coding.
The “wait time” problem: thinking speed vs. writing speed
When people say a model is “slow,” they usually mean one of two delays:
Thinking speed (time-to-first-token): How long before the first word appears.
Writing speed (tokens per second): How fast it outputs once it starts.
For reasoning models and agents, the most frustrating part is often the first one, the system is planning steps, checking itself, and choosing tools before it prints anything. OpenAI hasn’t published a clean split showing how much Cerebras improves “first token” vs “tokens per second,” but “ultra low-latency” inference is aimed at making both feel faster: less dead air up front, and faster completion once output starts.
Why Cerebras can feel faster: the wafer-scale idea
Most AI today runs on GPU clusters, many chips working together. That’s powerful, but it often means shuffling data between chips and memory, which can add delay. Cerebras’ approach is different; it builds systems that keep much more “close together” on one giant piece of silicon, so the system spends less time moving data around.
In simple terms:
GPU clusters: Fast, but lots of “passing notes” between chips.
Wafer-scale: Fewer hand-offs, which makes inference feel more immediate.
This is why Cerebras emphasizes tokens per second (TPS). It has claimed up to ~3,000 TPS on optimized architectures, a peak figure that, while naturally lower for massive reasoning-class models, still sets a new ceiling for “real-time” AI responsiveness.
Why gpt-oss-120B mattered (and why this deal was the next step)
A fair question for 2026 readers: why is OpenAI running an open-weight model on Cerebras at all?
Because gpt-oss-120B was the proof point, and it wasn’t lightweight. OpenAI positioned gpt-oss-120B as a heavyweight reasoning model with strong benchmarks. Cerebras then used it as the showcase for what “real-time” inference looks like, claiming 3,000 TPS performance on its inference cloud.
That’s the connective tissue: if Cerebras could handle a complex architecture like gpt-oss-120B at extreme speed, the next step was inevitable, scale the lane and use it to accelerate the “think then respond” loops that sit behind ChatGPT’s most valuable experiences.
The business drama: G42 and the “IPO-ready” revenue story
This is also a major reset for Cerebras’ business narrative.
OpenAI contract helps Cerebras diversify away from G42, which accounted for 87% of its revenue in early 2024. The timing is critical: Reuters reported in December 2025 that Cerebras was targeting Q2 2026 for an IPO.
A $10B+ OpenAI deal doesn’t just add credibility, it effectively anchors the revenue story heading into a public listing, moving the company from a “startup with one big customer” to a major supplier powering OpenAI’s inference scale.
Bottom line: OpenAI isn’t only scaling compute; it’s scaling responsiveness. The 750MW number is the headline, but the real bet is that wafer-scale inference can make reasoning feel fast enough that users stop noticing the pause at all.
Y. Anush Reddy is a contributor to this blog.



