NVIDIA’s Nemotron 3 Ultra targets faster, cheaper long-running agents

NVIDIA’s new Nemotron 3 Ultra isn’t just another big language model announcement; it’s NVIDIA stepping squarely into the frontier-model arena with an open-weight system built from the ground up for long-running AI agents, not just chatbots. It’s a 550-billion-parameter Mixture-of-Experts model (with 55 billion parameters active at any given token) that NVIDIA is positioning as an open, high-speed reasoning engine for tooling-heavy workflows, coding copilots, research agents, and complex enterprise automations.

What makes Nemotron 3 Ultra interesting is not only its size, but what it represents: a GPU company that has spent years selling picks and shovels to the AI gold rush is now shipping its own frontier-class open model to run on that hardware, and signaling that “agentic” workloads are the next big battleground.

Nemotron 3 Ultra: NVIDIA’s open frontier move

Nemotron 3 Ultra sits at the top of NVIDIA’s Nemotron 3 family of open models, above earlier releases like Nemotron 3 Super, which itself was a 120B-parameter open MoE model for multi-agent systems. Ultra is the flagship: a “frontier-scale” model with 550B total parameters, 55B active, built around a hybrid architecture that mixes Mixture-of-Experts with Mamba and Transformer layers.

NVIDIA describes it as a general-purpose reasoning and chat model optimized specifically for demanding agent workloads: multi-step planning, tool calling, code and math reasoning, long-context document analysis, and orchestration across many sub-tasks. In other words, this isn’t just about answering a single prompt; it is meant for AI systems that think, plan, and act over hundreds of turns and tool calls.

The model is fully open weights and part of an “open ecosystem” push: NVIDIA is releasing not just weights but also training data recipes and fine-tuning workflows under an open license via the Linux Foundation, which gives enterprises and researchers a relatively permissive base to customize for their own stacks. For developers in the US and beyond who want top-tier intelligence without going fully proprietary, Nemotron 3 Ultra is clearly targeting that “open frontier” niche.

Built for long-running agents, not one-shot chats

If you look at how NVIDIA frames Nemotron 3 Ultra, the word that keeps appearing is “agent.” Traditional chatbots care mostly about single interactions: a user prompt, a response, maybe a few turns. Long-running agents are different. They need to hold on to context across many steps, call tools and APIs, write and debug code, query databases, read large docs, and then stitch everything together into a coherent plan.

Nemotron 3 Ultra is tuned precisely for that kind of workflow. NVIDIA says the model is optimized for long-context reasoning (up to around 1 million tokens), which means an agent can ingest huge codebases, multi-hundred-page PDFs, or extended conversation histories and still reason effectively. The design is also geared around tool-heavy behavior: coding agents, deep research assistants, enterprise workflow orchestrators, and EDA (electronic design automation) scenarios where agents need to reason stably across many steps without drifting or forgetting the bigger picture.

Under the hood, the model uses a hybrid Latent Mixture-of-Experts architecture with interleaved Mamba-2, MoE, and some Attention layers. That combination is meant to give you the best of three worlds: MoE for scale and efficiency, Mamba for long-sequence handling, and attention where it still matters most. NVIDIA emphasizes that only a subset of experts are active for each token (55B active out of 550B), which is how it keeps inference costs manageable while still having a massive capacity to draw on when needed.

From chips to full AI stacks

For years, NVIDIA has been the default choice for AI hardware in data centers, with GPUs like H100 and the newer Blackwell parts powering most leading models. Nemotron 3 Ultra shows that NVIDIA doesn’t just want to sell hardware; it wants to define the reference models that run on it, too. The model is tuned for NVIDIA’s NVFP4 precision format, which packs weights into a 4-bit floating point representation that can cut memory usage and improve throughput on Blackwell and Hopper GPUs.

NVIDIA claims Nemotron 3 Ultra can deliver roughly 5x higher throughput than comparable open frontier models, while lowering the cost of complex agent workloads by up to about 30 percent in some setups. That kind of performance-per-dollar story matters if you’re building agents that may run for hours, bouncing among tools, APIs, and internal services. It’s also a pointed message to cloud providers and AI startups: if you want the best performance on NVIDIA hardware, maybe NVIDIA’s own open model is the most optimized starting point.

Architecture, numbers, and what they mean in practice

The headline specs are straightforward: Nemotron 3 Ultra is a 550B-parameter model with 55B active parameters, hybrid LatentMoE + Mamba + Transformer architecture, 1M-token context, text-in/text-out, and support for multiple languages, including English and major global languages. It’s designed as a reasoning-first model, with a “think-then-answer” style where it generates an internal reasoning trace before producing the final user-visible response, similar to recent “chain-of-thought” oriented systems.

Benchmarks from Artificial Analysis suggest that Nemotron 3 Ultra is currently the most capable US open-weight model on their Intelligence Index, scoring 47.7, ahead of models like Gemma 4 31B, Nemotron 3 Super, and gpt-oss-120B, though still behind some Chinese open-weight efforts like Kimi K2.6. On their Agentic Index, Nemotron 3 Ultra also leads among open models, meaning its performance on agent-specific tasks, multi-step workflows, and instruction-following is especially strong.

At the same time, it’s not just about intelligence; speed is a core selling point. On BlackBox AI ahead of release, Nemotron 3 Ultra was measured at over 400 output tokens per second, which is impressive given that it’s more than 4x larger than some competitors it outpaces in throughput. That combination of “frontier-ish smarts” and high throughput is what puts it on the so-called Pareto frontier for speed vs performance: you can’t easily get significantly more intelligence without paying a clear speed penalty, or vice versa.

How it compares to other open models

Nemotron 3 Ultra doesn’t exist in a vacuum. It’s landing in an ecosystem that already has serious open contenders from Google (Gemma 4), other Nemotron variants, and large community-driven efforts. On Artificial Analysis’s rankings, it leads other US open-weight models in overall intelligence, but there are tradeoffs.

Here is a snapshot comparison based on currently available public data:

Model	Parameters (total/active)	Context window	Strengths (high level)
Nemotron 3 Ultra	550B / 55B	Up to 1M tokens	Agentic reasoning, speed, long-running workflows
Nemotron 3 Super	120B / 12B	Up to 1M tokens	Multi-agent, efficient voice and conversation agents
Gemma 4 31B	~31B	Long, but smaller than 1M (varies by deployment)	Strong coding in its class, competitive reasoning
gpt-oss-120B	~120B	Long-context (varies)	Open frontier-style model, good all-rounder

On coding benchmarks, for example, Gemma 4 31B apparently scores a bit higher than Nemotron 3 Ultra on the Artificial Analysis Coding Index, which suggests that if your top priority is pure code-completion quality in that test suite, Gemma may still be slightly ahead. But on broader reasoning and agent tasks, Nemotron 3 Ultra clearly leads among US open-weight releases at the moment, especially when you factor in speed.

Compared to Nemotron 3 Super, Ultra is a big jump up in scale and intelligence. Super, with its 120B parameters and 12B active configuration, was already pitched for multi-agent applications and voice agents, with strong performance on long conversations and reasoning tasks at lower cost. Ultra more or less takes that concept to the frontier level: more capable reasoning, larger context, and higher overall performance, albeit on beefier hardware.

Open weights, open recipes, and the Linux Foundation angle

One of the more surprising dimensions of this launch is how open NVIDIA is going with it. The company is releasing Nemotron 3 Ultra under an open license via the Linux Foundation, and the package is not limited to just weights. NVIDIA is also making training data recipes, fine-tuning workflows, and related tooling available, effectively offering a blueprint for how the model was built and how you can adapt it.

For enterprises in regulated sectors or those with heavy compliance requirements, that matters. Being able to run an open frontier-class model in your own VPC, on-prem cluster, or custom infrastructure, with the option to fine-tune on private data without sending anything to a third-party API, makes the model much more appealing. For the broader open-source and research community, it’s also a data point in the “can big vendors still meaningfully support open AI?” debate.

This approach lines up with the earlier Nemotron 3 Super launch, where NVIDIA emphasized high-efficiency, open MoE models that developers could run using vLLM, Hugging Face, and other standard stacks. With Ultra, NVIDIA is amplifying that story: their hardware, their precision formats, their inference libraries, their open model. It’s a vertically integrated pitch, but with enough openness to keep developers interested.

Performance, cost, and the NVFP4 story

A lot of the efficiency narrative around Nemotron 3 Ultra comes down to NVFP4, NVIDIA’s 4-bit floating point format that’s designed to offer a sweet spot between compression and numerical stability. By storing weights in NVFP4, NVIDIA can cram more of the model into GPU memory while still keeping quality high, which is crucial for a 550B-parameter system.

In practice, NVIDIA and early partners report that Nemotron 3 Ultra can hit high throughput – on the order of hundreds of tokens per second – and that it outperforms previous Nemotron models and many competitors in throughput while also delivering higher quality. NVIDIA claims up to around 5x higher throughput versus comparable open frontier models and as much as 30 percent lower cost for certain agentic workloads, assuming you’re running on the right NVIDIA hardware.

That does introduce a subtle lock-in question: yes, the model is open, but it’s tuned aggressively for NVIDIA GPUs. Partners like SGLang and Miles have already announced “day 0” support and show Nemotron 3 Ultra running efficiently on Blackwell and Hopper hardware with NVFP4 and BF16. If you’re already in the NVIDIA ecosystem, it’s very attractive. If you’re betting on alternative accelerators, you may not see the same advantages.

What this means for developers and enterprises

For developers, especially those in the US and Europe who want a powerful open-weight model for agents, Nemotron 3 Ultra is likely to become a default candidate alongside the usual closed-API giants. It gives you:

A frontier-scale reasoning engine with strong performance on multi-step and agentic tasks.
A 1M-token context window that can handle massive documents, logs, or codebases.
An open-weight, open-recipe release that you can fine-tune and deploy under your control.
Optimizations for NVIDIA hardware that can significantly reduce latency and cost if you already run on that stack.

For enterprises, the appeal is slightly different. Nemotron 3 Ultra can be slotted into existing RAG pipelines, LLM gateways, and agent orchestrators as the “brain” behind everything from helpdesk automation to financial analysis and EDA workflows. The open licensing and Linux Foundation alignment make it easier to pitch to legal teams that are wary of black-box proprietary APIs. And for companies that have already invested heavily in on-prem GPU clusters, Nemotron 3 Ultra is a way to extract more value from that hardware without being locked into a single LLM vendor.

Where this leaves the AI race

Zooming out, Nemotron 3 Ultra is another example of how the AI race is no longer just about who has the biggest proprietary model. We’re now seeing a clear open frontier tier emerge: extremely capable open-weight models, backed by major players, that try to balance quality, speed, and deployment flexibility. NVIDIA now has one of the strongest entries in that tier, at least in the US context.

It also underscores the shift from single-shot chat to agentic systems as the next phase of LLM adoption. NVIDIA is not just saying “this model can chat”; it’s saying “this model is the orchestration engine for fleets of agents that can plan, reason, and act over long stretches of time.” That framing aligns with how many dev teams are now thinking: tools, APIs, workflows, and “do something useful,” not just “answer this question.”

Discover more from GadgetBond

Subscribe to get the latest posts sent to your email.

GadgetBond

NVIDIA’s Nemotron 3 Ultra targets faster, cheaper long-running agents

Nemotron 3 Ultra: NVIDIA’s open frontier move

Built for long-running agents, not one-shot chats

From chips to full AI stacks

Architecture, numbers, and what they mean in practice

How it compares to other open models

Open weights, open recipes, and the Linux Foundation angle

Performance, cost, and the NVFP4 story

What this means for developers and enterprises

Where this leaves the AI race

Discover more from GadgetBond

Leave a ReplyCancel reply

The biggest announcements from Samsung’s London Galaxy Unpacked

Samsung’s new Z Flip8 and Fold8 are open for preorder with fresh retailer incentives

The day OpenAI’s experimental model broke out of its security sandbox

Samsung Galaxy Z Fold8 Ultra, Fold8, and Flip8 arrive with advanced AI

How to create standalone apps from any web page

Tired of reading? Here is how to make Chrome read your favorite websites aloud

How to turn Google Chrome’s spelling tools on or off

How to move your Chrome address bar to the bottom on mobile

Clean slate browsing: here’s how Chrome’s Guest profile works

Apple Maps is finally coming built-in to Ford’s next-gen EVs

The Morning Show sets the stage for its 2027 farewell

Anthropic open-sources its AI labor data inside Claude

OpenAI launches Presence to bring guardrails to autonomous agents