NVIDIA adds MiniMax M2.7 to its AI stack for production-ready agents

NVIDIA just quietly flipped a pretty big switch: MiniMax M2.7, the latest in MiniMax’s agentic model lineup, is now live with open weights through NVIDIA’s ecosystem and the broader open-source inference stack. It’s the kind of move that doesn’t just add yet another LLM to the pile—it gives developers a serious, production-grade agent and coding workhorse they can actually run, tune, and scale on their own terms.

At its core, MiniMax M2.7 is a sparse mixture-of-experts (MoE) model tuned for long-running, tool-using agents rather than just chatty assistants. You’re looking at 230 billion total parameters, but only about 10 billion “active” per token, thanks to its MoE routing design—so you get big-model capacity without paying big-model inference costs every time you send a prompt. The architecture leans on multi-head causal self-attention, RoPE positional embeddings, and Query-Key RMSNorm, plus a top-k expert routing scheme that selectively activates just 8 of 256 experts per token at an activation rate of roughly 4.3%. In plain language: the model stays smart, but it’s picky about which parts of its brain it uses at any moment, which is why it can scale.

There’s also the context window, which is frankly huge: up to 200K tokens. That’s long enough to feed entire codebases, multi-step research traces, or dense technical documents into one session and still have room for the model to reason over them. For anyone designing autonomous agents, research pipelines, or AI dev tools, that kind of context isn’t a “nice to have”—it’s the difference between a toy assistant and something that can actually keep track of what it’s doing across long workflows.

MiniMax’s own positioning of M2.7 makes the intent clear: this isn’t just another generalist model—it’s explicitly marketed as capable of building and driving complex agent harnesses, including multistep productivity workflows, AI coding tools, and interactive environments. On internal and public benchmarks, they’re not shy about where they think it lands: MiniMax reports strong performance on complex skills over long prompts, a 97% “skill adherence” rate on tasks with more than 2,000 tokens, and substantial gains over M2.5 in real agent frameworks like OpenClaw. They also highlight that M2.7 approaches Anthropic’s Claude Sonnet-class performance on MMClaw-style evaluations, while remaining fully open-weights.

On the hard-numbers side, MiniMax cites serious coding and reasoning scores: on SWE-Pro, M2.7 lands just below frontier closed models like Claude Opus levels, and it extends that capability into full project delivery (VIBE-Pro) and terminal-style system understanding (TerminalBench 2). For devs, that translates to a single model that can not only write functions, but own end-to-end tasks: structuring a repo, wiring services, and iterating under feedback.

NVIDIA’s angle here is just as important as the model specs. With M2.7’s open weights now available through NVIDIA, the company is turning its AI stack into a kind of “reference highway” for serious open models. The model is integrated into multiple layers of that stack:

First is NVIDIA NemoClaw, an open-source reference stack focused on “always‑on” agents. NemoClaw sits on top of NVIDIA OpenShell, a secure runtime for autonomous agents that can call tools, hit endpoints, and run open models like M2.7 in a guarded environment. From a developer’s perspective, this matters because long‑running agent systems are a nightmare to wire up safely—NemoClaw gives you a one-command setup that provisions OpenClaw + OpenShell on NVIDIA’s Brev cloud GPU platform, so you can go from “reading about M2.7” to “actually running an agent” in minutes instead of days.

Then there’s the inference stack. NVIDIA has been systematically optimizing open MoE models, and M2.7 benefits from that work in vLLM and SGLang. The company and the open-source community collaborated to add high-performance kernels that specifically target MoE pain points. Two stand out:

A fused QK RMSNorm kernel, which combines computation and communication into a single pass—this cuts kernel launch overhead and memory operations and improves overall inference throughput.
An FP8 MoE kernel built on TensorRT-LLM, tuned for MoE routing and designed to squeeze more performance out of NVIDIA GPUs without tanking quality.

On NVIDIA Blackwell Ultra GPUs, those optimizations are not just theoretical. NVIDIA reports that on a standard 1K/1K input/output sequence dataset, they saw up to 2.5× throughput gains with vLLM and up to 2.7× with SGLang over a single month of tuning. That kind of rapid iteration means the “stack” around M2.7 is evolving almost as fast as the model lineup itself.

If you’re actually deploying this thing, NVIDIA’s made sure the serve story is straightforward. For vLLM, you can launch M2.7 with a standard CLI: set the model path to the MiniMax M2.7 weights, configure tensor parallelism (for example, --tensor-parallel-size 4), and enable expert parallelism alongside tool-call and reasoning parsers tuned specifically for MiniMax’s format. For SGLang, there’s a similar one-liner: you point to MiniMaxAI/MiniMax-M2.7, wire up tensor parallelism, memory fraction, batch size, FP8 quantization, and specify the MoE backend (flashinfer_trtllm_routed) plus the FP8 GEMM backend. In practice, this gives teams a choice: use vLLM’s more “generalist” serving framework, or SGLang’s agent-forward runtime that’s already popular in coding and tools-heavy setups.

On the access side, there are three main doors depending on how deep you want to go:

build.nvidia.com: NVIDIA is exposing MiniMax M2.7 as a free, GPU-accelerated endpoint through its Build portal. You can test prompts right in the browser, plug in your own data, and see how the model performs before committing to any infra decisions.
NVIDIA NIM microservices: when you’re ready to go beyond tinkering, the same model is packaged as an optimized NIM container—a production-ready inference microservice you can deploy on-prem, in the cloud, or in hybrid setups. The MiniMax M2.7 NIM container is tuned for NVIDIA GPUs and slots into the broader NIM ecosystem that now spans 100+ models with free-tier inference.
Hugging Face + direct weights: if you want full control, the open weights are available on Hugging Face via MiniMax’s official org, with support for frameworks like vLLM, SGLang, and NVIDIA NeMo. That’s the route you take if you’re building your own stack, doing custom sharding strategies, or experimenting with bespoke inference kernels.

For teams that don’t just want to run the base model, NVIDIA is also making sure post-training is a first-class path. Through the NVIDIA NeMo framework, you can use the NeMo AutoModel library to fine-tune MiniMax M2.7 with officially documented recipes and example configs. There are sample fine-tune configs for specific tasks (for example, Hellaswag-style reasoning), plus references to the latest checkpoints on Hugging Face, so you’re not starting from scratch.

If you’re chasing alignment or reward-shaped behavior, the NeMo RL library supports reinforcement learning on M2.7, including public recipes for different sequence lengths (like 8K and 16K sequence RL setups) and shared accuracy validation curves so you can sanity-check your runs. From a practical standpoint, that means you can do things like: teach M2.7 to be extra strict about tool use, optimize it for code reliability over raw creativity, or tune it for a specific internal evaluation suite.

Zooming out, the timing and positioning of MiniMax M2.7 on NVIDIA’s stack say a lot about where the ecosystem is heading. Over the past year, NVIDIA has aggressively expanded free and open-weight model access via NIM and Build, covering everything from DeepSeek to GLM 4.7 and earlier MiniMax M2.x releases. Now, with M2.7’s open weights and first-class support in vLLM, SGLang, NemoClaw, and NIM, there’s a clear pattern:

Agentic workflows are becoming “default,” not niche. M2.7 is explicitly engineered and marketed for agents, coding, and long-running workflows, and NVIDIA is shipping the surrounding runtime (OpenShell, NemoClaw, OpenClaw) to make those workloads sane to operate.
MoE is crossing from research into production. The throughput gains NVIDIA is showing on Blackwell Ultra with FP8 MoE and fused kernels make it plausible to run something with 230B parameters behind the scenes without needing hyperscaler-only budgets.
Open weights are becoming a competitive edge, not just a community checkbox. MiniMax is openly comparing M2.7 against frontier‑class proprietary models in benchmarks and still choosing to release it with open weights, while NVIDIA makes it trivial to deploy on its own hardware. For enterprises worried about lock-in, that combination—strong performance plus self-hosting options—is starting to look very attractive.

If you’re a developer, researcher, or a company building around AI agents, devtools, or complex productivity workflows, the takeaway is pretty simple: MiniMax M2.7 is now something you can actually put to work. You can:

Hit build.nvidia.com and try it in minutes via a browser endpoint.
Spin it up with vLLM or SGLang and ride the latest MoE kernel optimizations.
Wrap it in NemoClaw + OpenShell to experiment with always-on agents that can call tools and systems securely.
Fine-tune or RL-train it with NeMo if you want a domain-tuned variant for your own stack.

And because the weights are open, you’re not just renting the model—you can take it with you, dissect it, and keep evolving it alongside your own systems.