Google is turning the speed dial way up on Gemma 4. The company has released a new set of Multi-Token Prediction (MTP) “drafters” that sit alongside the main Gemma 4 models and can make them up to three times faster at inference, without sacrificing the quality of the answers you get back.
If you think about how most large language models work today, they’re basically very smart but slightly plodding typists. They generate one token at a time, in strict order, constantly shuttling billions of parameters in and out of GPU memory just to decide the next word. That “one-token-at-a-time” loop is surprisingly constrained not by raw compute, but by memory bandwidth, which is why even powerful GPUs can feel weirdly underused while your model is generating at a modest pace. Google’s MTP drafters are designed to attack that bottleneck head-on.
The basic idea is speculative decoding, and it’s simpler than it sounds once you strip away the jargon. You pair a big, slow-but-smart “target” model like Gemma 4 31B with a smaller, faster “drafter” that tries to guess several future tokens in one go. While the main model is still doing its heavy computation, the drafter runs ahead and proposes a sequence of likely continuations, which the big model then checks in a single forward pass. If the target model agrees, it accepts the whole chunk — and even adds one more token of its own — so your app effectively gets multiple tokens in the time it used to take to generate just one.

This is where the claimed “up to 3x speedup” comes from. By offloading the easy parts of the job to a much smaller helper, the system uses idle compute that would otherwise just be waiting on memory, especially on consumer GPUs and laptops. Google says the MTP drafters are tuned so that you don’t trade speed for sloppiness: the larger Gemma 4 model still has veto power over every token, which means reasoning quality and accuracy stay effectively the same while latency drops.
Importantly, this is not just a lab demo. The MTP drafters cover the full Gemma 4 lineup: the 31B dense flagship, the 26B Mixture-of-Experts variant, and the smaller E2B and E4B models aimed at phones, tablets and other edge devices. That means the same speculative decoding trick works for everything from cloud-based coding agents to on-device chatbots and voice assistants that need to feel snappy on mobile hardware.
Under the hood, Google has done some clever engineering, so these drafters aren’t just bolted on as an afterthought. The drafter models are small — think a handful of layers rather than a giant stack — and they reuse the target model’s internal activations and KV cache. That means they don’t waste time recomputing context; the main model already worked out, which is a big part of why the speedups are real and not just marketing. For the edge-focused E2B and E4B variants, Google has even added efficient clustering in the embedder step to make the final logits calculation faster, which becomes a major bottleneck on smaller devices.
If you’re a developer, the pitch is straightforward: faster responses, same brains. Less latency directly translates into smoother chat experiences, more natural-feeling voice interfaces, and more capable multi-step agents that can plan and react without constant awkward pauses. On the local side, Google specifically calls out running the 26B MoE and 31B dense models on personal machines and consumer GPUs with what they describe as “unprecedented” speed, so offline coding assistants and local agent workflows become much more realistic.
There’s also a throughput angle, not just latency. On GPUs like NVIDIA’s A100 or on Apple Silicon, processing multiple requests together with larger batch sizes lets speculative decoding really shine, with Google observing speedups of around 2.2x in some local Apple Silicon scenarios when you bump the batch from 1 to the 4–8 range. That matters for anyone running shared backends or multi-user tools, where keeping the GPU busier without increasing costs per request is a big win.
From an ecosystem point of view, Google is clearly trying to make this as accessible as possible. The MTP drafters ship under the same Apache 2.0 open-source license as Gemma 4 itself, and the weights are already live on major hubs like Hugging Face and Kaggle as “assistant” models that correspond to each main checkpoint. On the software side, support is rolling out across the usual suspects: Transformers, vLLM, SGLang, MLX for Apple Silicon, LiteRT-LM for edge deployments, plus Ollama for developers who prefer a more turnkey local setup.
In practical terms, using MTP with Gemma 4 in popular frameworks is meant to feel more like flipping a switch than rewriting your stack. In Hugging Face Transformers, for example, you load the main Gemma 4 model as the target, then load the matching “-assistant” drafter and pass it as an assistant_model argument when you call generate. vLLM exposes it via a speculative decoding config flag, and some on-device LiteRT exports bake the MTP structures directly into the graph so that edge runtimes can exploit them with minimal custom code.
The timing of this release is also interesting when you zoom out a bit. Gemma 4 has only been around for a few weeks and has already crossed roughly 60 million downloads, which is a huge number for an open-weight model family that isn’t branded as the “main” Google assistant product. Shipping speculative decoding drafters this quickly suggests Google sees Gemma not just as a research artifact, but as a serious platform it wants people to build real products on — including those that need to run on personal hardware, not just in hyperscale data centers.
From a user’s perspective, you’re not going to see “MTP drafter engaged” on screen, but you will feel it in how the model behaves. Messages start streaming faster, long-form answers land sooner, and coding sessions with an AI pair programmer feel less like waiting for a remote server and more like interacting with a responsive local tool. On phones and tablets, the gains are even more visible because every millisecond saved is one less hit to the battery and one more reason to keep tasks on-device instead of round-tripping to the cloud.
There is also a broader trend here: speculative decoding is becoming a standard ingredient in modern LLM serving stacks, not a quirky research trick. We’ve already seen community-driven efforts like Medusa and Eagle pushing parallel token generation, and Google’s MTP implementation for Gemma 4 is another sign that “single-token, strictly autoregressive decoding” is slowly evolving into “smart, parallel-verified drafting” for any serious, latency-sensitive workload.
All told, Google’s MTP drafters for Gemma 4 are less about a flashy new model and more about making existing intelligence feel usable in real time. With open weights, broad framework support, and tangible speed gains on everything from workstation GPUs to mobile devices, this is the kind of under-the-hood upgrade that developers notice first — and end users quietly benefit from every time their AI stops feeling like it’s thinking in slow motion.
Discover more from GadgetBond
Subscribe to get the latest posts sent to your email.
