NVIDIA Vera Rubin POD unites seven chips into one AI powerhouse

NVIDIA’s new Vera Rubin POD is what happens when you stop thinking in “servers and GPUs” and start thinking in “factories that mint tokens.” It is built on the third-generation NVIDIA MGX rack architecture, and the whole thing has been co-designed from grid to chip for a world where AI agents talk more to each other than to us.

At the heart of it is a simple shift: AI is now a token business. Every prompt, tool call, planning step, and agent hand-off is a torrent of tokens, and NVIDIA is openly designing around one metric—how many useful tokens you can squeeze out per watt and per rack. Over the past year, global token generation has blown past 10 quadrillion per year, and NVIDIA’s bet is that most of the next wave will be AI-to-AI, not human-to-AI. That means more reasoning tokens, bigger KV caches, and a constant need for CPU sandboxes where agents can safely execute and validate code before anything hits production systems.

The Vera Rubin POD is NVIDIA’s answer: 40 racks acting as one AI supercomputer, built from seven chips and five purpose-built rack-scale systems that are meant to be deployed as a single cohesive platform rather than a pile of parts. Those seven chips span compute, networking, storage, and data movement—Rubin GPUs, Vera CPUs, NVLink 6 switches, ConnectX‑9 SuperNICs, BlueField‑4 DPUs, Spectrum‑6 Ethernet switches, and Groq 3 LPUs—tied together so tightly that they behave less like a cluster and more like a monolithic machine. In total, a fully built Vera Rubin POD packs roughly 1.2 quadrillion transistors, nearly 20,000 NVIDIA dies, 1,152 Rubin GPUs, about 60 exaflops of AI performance, and 10 PB/s of total scale‑up bandwidth.

NVIDIA splits this supercomputer into five distinct rack-scale systems, each tuned for a different pain point in modern agentic AI. First is Vera Rubin NVL72, the flagship rack that NVIDIA likes to describe as “one giant GPU.” Inside a single third‑gen MGX rack, it integrates 72 Rubin GPUs and 36 Vera CPUs, all stitched together via a massive NVLink 6 copper spine. NVLink delivers 3.6TB/s per GPU and 260TB/s of scale‑up bandwidth per rack—more than the entire global internet’s bandwidth flowing behind a single row of trays. This is the engine that handles the four “scaling laws” NVIDIA cares about now: pretraining, post‑training, test‑time scaling, and agentic scaling, especially for massive mixture‑of‑experts models.

Compared to the Blackwell generation, Rubin is designed to be brutally efficient. NVIDIA is positioning Rubin NVL72 as delivering up to 4x better training performance and up to 10x better inference performance per watt, with roughly one‑tenth the token cost for MoE inference versus Blackwell-based NVL72 systems. For big MoE models, Rubin can train with about one‑fourth the number of GPUs Blackwell needs in the same timeframe, which is the kind of ratio CFOs and cloud operators actually care about.

The tray hardware that makes all this work has been heavily rethought. Vera Rubin compute trays use a PCB midplane in a single‑wide MGX rack, allowing NVIDIA to go cable‑free, hose‑free, and fanless on the trays themselves. That sounds minor until you look at operations: assembly time drops from nearly two hours to around five minutes per compute tray, a 20x speed‑up that matters a lot when you’re rolling out hundreds or thousands of racks. Each tray holds two Vera Rubin superchips, and each superchip carries roughly 17,000 components—about five times what you’d find on a modern smartphone board.

Then there’s the next level of scale: Vera Rubin Ultra NVL576. Here, NVIDIA links eight MGX NVL racks in an all‑to‑all NVLink topology so that up to 576 GPUs live in a single NVLink domain. This topology has already been prototyped internally as “Polyphe” using GB200, and Rubin Ultra essentially pushes the same idea forward with next‑gen silicon and copper plus direct optical links between racks. Looking even further out, NVIDIA is teasing “Kyber” NVL1152—a next‑generation MGX rack that doubles the NVLink domain per rack to 144 GPUs and then scales to 1,152 GPUs across eight racks. That’s the chassis foundation for their future Feynman platform, and it’s meant to keep the “one system” illusion intact even as GPU counts go fully absurd.

If NVL72 is the brawn, then the NVIDIA Groq 3 LPX racks are the reflexes. These racks are dedicated inference accelerators built around 256 Groq 3 language processing units per rack, optimized for ultra‑low‑latency token generation at long context lengths. They use a direct chip‑to‑chip copper spine—thousands of point‑to‑point connections across 32 compute trays—to make the entire rack behave like a single, extremely fast inference engine. Paired with Rubin GPUs (with their huge HBM stacks), LPX racks effectively remove the traditional trade‑off between speed and throughput: NVIDIA claims up to 35x more tokens and up to 10x more revenue opportunity for trillion‑parameter models versus Blackwell-based systems, at the same or lower latency.

Agentic AI doesn’t run only on GPUs and LPUs, though. It also needs a huge amount of CPU horsepower for sandboxing—running tools, simulations, and reinforcement learning environments safely at scale. That is where the Vera CPU racks come in. A single Vera CPU rack can host up to 256 Vera CPUs in a dense, liquid‑cooled configuration, enough to sustain more than 22,500 concurrent reinforcement learning or agent sandbox environments. NVIDIA says these racks deliver twice the efficiency and up to 50% faster performance than traditional CPU racks designed the old‑fashioned way, which makes them a kind of “agent test track” sitting right next to the main GPU engines.

Storage gets the same AI‑first treatment with BlueField‑4 STX racks. These are based on the BlueField‑4 processor, which combines a Vera CPU with a ConnectX‑9 SuperNIC, and then scales out using Spectrum‑X Ethernet. The key feature here is NVIDIA CMX, which NVIDIA describes as an AI‑native “context memory” platform: think of it as a KV cache warehouse designed for LLMs. Instead of treating KV cache as a purely in‑GPU concept, CMX offloads and shares it across the POD, turning temporary inference context into a first‑class data type that can be reused across turns, sessions, and even different agents. The payoff is big: NVIDIA claims up to 5x more tokens per second and up to 5x better power efficiency versus traditional storage setups that weren’t built with KV cache in mind.

All of this is bound together by Spectrum‑6 SPX networking racks, which form the POD’s nervous system. These racks provide the east‑west and north‑south traffic for an AI factory, using either Spectrum‑X Ethernet or Quantum‑X800 InfiniBand switches. The flagship Spectrum‑6 switch offers 102.4Tb/s of bandwidth with 512 lanes and 200Gb/s co‑packaged optics, swapping out pluggable optics for silicon photonics to cut power and latency while keeping effective bandwidth near theoretical limits. In practice, that means GPU, CPU, storage, and LPX racks can be kept in lockstep, even as workloads churn between training, inference, RL, and agentic flows.

What makes Vera Rubin POD particularly interesting is the third‑generation MGX rack under everything. Instead of treating the rack as an afterthought, NVIDIA has used MGX as the foundation for resiliency, serviceability, and energy efficiency. MGX racks use a single‑wide 19-inch design emphasizing PCB‑based connections and a modular copper “spine” backplane that can be configured for NVLink, Spectrum‑X Ethernet, or direct chip‑to‑chip LPU links. The result is a cable‑free and hose‑free tray layout, fewer moving parts, and a spine that can hold thousands of cables in pre‑integrated cartridges. For operators, this translates to simpler installs, easier logistics, and fewer nightmare maintenance windows when a cable goes bad.

Energy management is arguably where MGX gets most aggressive. At the rack level, MGX introduces dynamic power steering, which can move power budget in real time between CPUs, GPUs, and NVLink switch trays to keep each component in its most efficient operating band. NVIDIA also layers in rack‑level energy storage using capacitors, so sudden load spikes are absorbed locally while the grid sees a much smoother profile. Vera Rubin NVL72 pushes this further with “Intelligent Power Smoothing,” offering about six times more stored energy per GPU than the previous generation and a closed‑loop feedback system where GPUs track capacitor charge levels to better flatten demand. NVIDIA claims this can reduce peak current demands by up to 25% and sidestep the need for massive battery stacks just to ride out training spikes.

Zooming out to the facility view, NVIDIA is encouraging operators to move away from static Max‑P provisioning—where every rack is wired as if it will always run at full blast. Instead, MGX supports a dynamic “Max‑Q” level where power budgets are tuned to real workloads. Because AI factories run mixed jobs with different power profiles, this dynamic provisioning can reclaim stranded power and, according to NVIDIA, unlock up to 30% more GPUs in the same power envelope when combined with 45°C liquid cooling. The cooling approach matters here: by designing MGX racks to run with 45°C warm‑water inlets, many data centers can skip energy‑hungry chillers entirely and rely on ambient air plus dry coolers. NVIDIA argues that the power savings from that shift are big enough to fund roughly 10% more Vera Rubin NVL72 racks in the same data center power budget.

The MGX platform is also being pushed as an open standard. NVIDIA’s prior GB200 NVL72 rack and tray designs were contributed to the Open Compute Project, and third‑gen MGX builds on that same architecture. More than 80 partners now build MGX‑based systems, meaning everything from racks and chassis to cable cartridges, liquid manifolds, busbars, and side rails can be sourced from a mature supply chain rather than custom one‑offs. That ecosystem is what allows NVIDIA to say that Vera Rubin POD is not just a lab toy but something that can be manufactured and deployed at scale across cloud regions and sovereign AI facilities.

For NVIDIA, the Vera Rubin POD isn’t just hardware; it’s part of a larger “AI factory” story. The Vera Rubin DSX AI factory reference design is the playbook: a blueprint for co‑designed infrastructure that ties together chips, systems, software libraries, and facility controls, all tuned around maximizing tokens per watt and time to first production. On top of this, NVIDIA is pushing Omniverse DSX—a digital‑twin blueprint that lets operators simulate an entire AI factory before they build it, including power, cooling, layout, and token throughput, then continuously optimize once it’s live.

Put differently, Vera Rubin POD treats the data center as a product. Instead of buying GPUs, network cards, and storage separately, customers buy a design for an AI factory where everything—from the 45°C water loops and capacitor banks to the NVLink topology and KV cache placement—has already been co‑tuned. Racks no longer just house servers; they are carefully engineered building blocks that can be stamped out by OEMs worldwide and slotted together into gigawatt‑scale AI factories.

For developers and AI companies, the pitch is straightforward: if you’re building long‑context, multi‑agent, tool‑using systems, Rubin wants to be the fastest, cheapest way to turn megawatts into tokens. For operators, the story is about predictability: deterministic latency from NVLink and Spectrum‑X, smooth grid behavior through power smoothing, and an ecosystem of partners that have already debugged the messy mechanical and cooling details. And for NVIDIA, Vera Rubin POD is a way to cement itself not just as a chip vendor, but as the reference architecture for how AI infrastructure itself should look in the agentic era.