GadgetBond

  • Latest
  • How-to
  • Tech
    • AI
    • Amazon
    • Apple
    • CES
    • Computing
    • Creators
    • Google
    • Meta
    • Microsoft
    • Mobile
    • Samsung
    • Security
    • Xbox
  • Transportation
    • Audi
    • BMW
    • Cadillac
    • E-Bike
    • Ferrari
    • Ford
    • Honda Prelude
    • Lamborghini
    • McLaren W1
    • Mercedes
    • Porsche
    • Rivian
    • Tesla
  • Culture
    • Apple TV
    • Disney
    • Gaming
    • Hulu
    • Marvel
    • HBO Max
    • Netflix
    • Paramount
    • SHOWTIME
    • Star Wars
    • Streaming
Add GadgetBond as a preferred source to see more of our stories on Google.
Font ResizerAa
GadgetBondGadgetBond
  • Latest
  • Tech
  • AI
  • Deals
  • How-to
  • Apps
  • Mobile
  • Gaming
  • Streaming
  • Transportation
Search
  • Latest
  • Deals
  • How-to
  • Tech
    • Amazon
    • Apple
    • CES
    • Computing
    • Creators
    • Google
    • Meta
    • Microsoft
    • Mobile
    • Samsung
    • Security
    • Xbox
  • AI
    • Anthropic
    • ChatGPT
    • ChatGPT Atlas
    • Gemini AI (formerly Bard)
    • Google DeepMind
    • Grok AI
    • Meta AI
    • Microsoft Copilot
    • OpenAI
    • Perplexity
    • xAI
  • Transportation
    • Audi
    • BMW
    • Cadillac
    • E-Bike
    • Ferrari
    • Ford
    • Honda Prelude
    • Lamborghini
    • McLaren W1
    • Mercedes
    • Porsche
    • Rivian
    • Tesla
  • Culture
    • Apple TV
    • Disney
    • Gaming
    • Hulu
    • Marvel
    • HBO Max
    • Netflix
    • Paramount
    • SHOWTIME
    • Star Wars
    • Streaming
Follow US
AIGoogleTech

Gemma 4 just got faster with new MTP draft models

Google is rolling out Multi-Token Prediction drafters for Gemma 4, pushing open models to respond up to three times faster without hurting output quality.

By
Shubham Sawarkar
Shubham Sawarkar's avatar
ByShubham Sawarkar
Editor-in-Chief
I’m a tech enthusiast who loves exploring gadgets, trends, and innovations. With certifications in CISCO Routing & Switching and Windows Server Administration, I bring a sharp...
Follow:
- Editor-in-Chief
May 6, 2026, 9:00 AM EDT
Share
We may get a commission from retail offers. Learn more
Dark blue promotional graphic for Gemma 4 featuring glowing blue particle streaks radiating toward the center, with the text “Gemma 4” and subtitle “Multi-Token Prediction Drafters” displayed prominently.
Image: Google
SHARE

Google is turning the speed dial way up on Gemma 4. The company has released a new set of Multi-Token Prediction (MTP) “drafters” that sit alongside the main Gemma 4 models and can make them up to three times faster at inference, without sacrificing the quality of the answers you get back.

If you think about how most large language models work today, they’re basically very smart but slightly plodding typists. They generate one token at a time, in strict order, constantly shuttling billions of parameters in and out of GPU memory just to decide the next word. That “one-token-at-a-time” loop is surprisingly constrained not by raw compute, but by memory bandwidth, which is why even powerful GPUs can feel weirdly underused while your model is generating at a modest pace. Google’s MTP drafters are designed to attack that bottleneck head-on.

The basic idea is speculative decoding, and it’s simpler than it sounds once you strip away the jargon. You pair a big, slow-but-smart “target” model like Gemma 4 31B with a smaller, faster “drafter” that tries to guess several future tokens in one go. While the main model is still doing its heavy computation, the drafter runs ahead and proposes a sequence of likely continuations, which the big model then checks in a single forward pass. If the target model agrees, it accepts the whole chunk — and even adds one more token of its own — so your app effectively gets multiple tokens in the time it used to take to generate just one.

Bar chart titled “Gemma 4 MTP Drafter Speed-ups” comparing token-per-second performance gains across Gemma 4 models and hardware platforms, including Samsung S26 mobile GPU, Pixel TPU, Apple M4, and NVIDIA A100, with speed improvements ranging from up to 1.5x to 3.1x.
Image: Google

This is where the claimed “up to 3x speedup” comes from. By offloading the easy parts of the job to a much smaller helper, the system uses idle compute that would otherwise just be waiting on memory, especially on consumer GPUs and laptops. Google says the MTP drafters are tuned so that you don’t trade speed for sloppiness: the larger Gemma 4 model still has veto power over every token, which means reasoning quality and accuracy stay effectively the same while latency drops.

Importantly, this is not just a lab demo. The MTP drafters cover the full Gemma 4 lineup: the 31B dense flagship, the 26B Mixture-of-Experts variant, and the smaller E2B and E4B models aimed at phones, tablets and other edge devices. That means the same speculative decoding trick works for everything from cloud-based coding agents to on-device chatbots and voice assistants that need to feel snappy on mobile hardware.

Your browser does not support the video tag.

Under the hood, Google has done some clever engineering, so these drafters aren’t just bolted on as an afterthought. The drafter models are small — think a handful of layers rather than a giant stack — and they reuse the target model’s internal activations and KV cache. That means they don’t waste time recomputing context; the main model already worked out, which is a big part of why the speedups are real and not just marketing. For the edge-focused E2B and E4B variants, Google has even added efficient clustering in the embedder step to make the final logits calculation faster, which becomes a major bottleneck on smaller devices.

If you’re a developer, the pitch is straightforward: faster responses, same brains. Less latency directly translates into smoother chat experiences, more natural-feeling voice interfaces, and more capable multi-step agents that can plan and react without constant awkward pauses. On the local side, Google specifically calls out running the 26B MoE and 31B dense models on personal machines and consumer GPUs with what they describe as “unprecedented” speed, so offline coding assistants and local agent workflows become much more realistic.

There’s also a throughput angle, not just latency. On GPUs like NVIDIA’s A100 or on Apple Silicon, processing multiple requests together with larger batch sizes lets speculative decoding really shine, with Google observing speedups of around 2.2x in some local Apple Silicon scenarios when you bump the batch from 1 to the 4–8 range. That matters for anyone running shared backends or multi-user tools, where keeping the GPU busier without increasing costs per request is a big win.

From an ecosystem point of view, Google is clearly trying to make this as accessible as possible. The MTP drafters ship under the same Apache 2.0 open-source license as Gemma 4 itself, and the weights are already live on major hubs like Hugging Face and Kaggle as “assistant” models that correspond to each main checkpoint. On the software side, support is rolling out across the usual suspects: Transformers, vLLM, SGLang, MLX for Apple Silicon, LiteRT-LM for edge deployments, plus Ollama for developers who prefer a more turnkey local setup.

In practical terms, using MTP with Gemma 4 in popular frameworks is meant to feel more like flipping a switch than rewriting your stack. In Hugging Face Transformers, for example, you load the main Gemma 4 model as the target, then load the matching “-assistant” drafter and pass it as an assistant_model argument when you call generate. vLLM exposes it via a speculative decoding config flag, and some on-device LiteRT exports bake the MTP structures directly into the graph so that edge runtimes can exploit them with minimal custom code.

The timing of this release is also interesting when you zoom out a bit. Gemma 4 has only been around for a few weeks and has already crossed roughly 60 million downloads, which is a huge number for an open-weight model family that isn’t branded as the “main” Google assistant product. Shipping speculative decoding drafters this quickly suggests Google sees Gemma not just as a research artifact, but as a serious platform it wants people to build real products on — including those that need to run on personal hardware, not just in hyperscale data centers.

From a user’s perspective, you’re not going to see “MTP drafter engaged” on screen, but you will feel it in how the model behaves. Messages start streaming faster, long-form answers land sooner, and coding sessions with an AI pair programmer feel less like waiting for a remote server and more like interacting with a responsive local tool. On phones and tablets, the gains are even more visible because every millisecond saved is one less hit to the battery and one more reason to keep tasks on-device instead of round-tripping to the cloud.

There is also a broader trend here: speculative decoding is becoming a standard ingredient in modern LLM serving stacks, not a quirky research trick. We’ve already seen community-driven efforts like Medusa and Eagle pushing parallel token generation, and Google’s MTP implementation for Gemma 4 is another sign that “single-token, strictly autoregressive decoding” is slowly evolving into “smart, parallel-verified drafting” for any serious, latency-sensitive workload.

All told, Google’s MTP drafters for Gemma 4 are less about a flashy new model and more about making existing intelligence feel usable in real time. With open weights, broad framework support, and tangible speed gains on everything from workstation GPUs to mobile devices, this is the kind of under-the-hood upgrade that developers notice first — and end users quietly benefit from every time their AI stops feeling like it’s thinking in slow motion.


Discover more from GadgetBond

Subscribe to get the latest posts sent to your email.

Topic:Gemini AI (formerly Bard)Google DeepMind
Leave a Comment

Leave a ReplyCancel reply

Most Popular

Perplexity Computer now works natively in Microsoft’s core productivity apps

OpenAI expands GPT-Rosalind access with new Rosalind Biodefense program

Codex computer use comes to Windows, with mobile in the loop

Anthropic raises $65 billion, nears trillion-dollar status

Claude Opus 4.8 now powers Perplexity Max and Computer

Also Read
Grocery, gardening, and household items from a Walmart delivery are arranged on a front doorstep outside a brick home. A blue Walmart shopping bag, a bag of Miracle-Gro potting mix, bread, and potted flowers sit on a welcome mat, surrounded by decorative planters and colorful blooming plants near a wooden front door.

Walmart’s 30-minute delivery is now live in 33 U.S. cities

Stylized rendering of a Qualcomm Snapdragon C processor mounted at the center of a translucent microchip, surrounded by circuit pathways on a light gray background. The black Snapdragon C logo stands out against the monochrome chip design, symbolizing computing performance, connectivity, and modern processor technology.

Qualcomm’s new Snapdragon C is the budget laptop chip nobody knew they were waiting for

Acer Aspire Go 15 (AG15-Q31P) powered by Qualcomm Snapdragon C chip

Acer Aspire Go 15 is the first laptop ever built on Qualcomm’s new Snapdragon C chip

Acer Swift Spin 14 AI (SFSP14-Q51T) laptop

Acer’s Swift Spin 14 AI is the convertible laptop that finally gets Snapdragon right

Split-panel graphic featuring a torn sheet of grid paper with black hand-drawn scribbles on a light blue background on the left, and a minimalist illustration of an open hand holding a connected node network symbol on a terracotta-orange background on the right, representing creativity, ideas, and collaborative intelligence.

Claude Opus 4.8 launches with sharper judgment and new controls

Minimal hand-drawn illustration of a hanging presentation screen displaying a coding symbol (“”), suspended above a stylized script-like “pm” mark on a solid terracotta-orange background, representing programming, development workflows, or coding education.

Claude Code now orchestrates its own dynamic workflows

Minimal flat illustration of code review: an orange background with two large black curly braces framing the center, where a white octagonal icon containing a simple code symbol “” is examined by a black magnifying glass.

Anthropic’s security-guidance plugin makes Claude Code less reckless

Perplexity illustration. The image depicts a dark, abstract interior space with vertical columns and beams of light streaming through, creating a play of shadows and light. In the center, there is a white geometric Perplexity logo resembling a stylized star or snowflake. The light beams display a spectrum of colors, adding a surreal and intriguing atmosphere to the scene.

Perplexity open-sources its blazing-fast Unigram tokenizer

Company Info
  • Homepage
  • Support my work
  • Latest stories
  • Company updates
  • GDB Recommends
  • Daily newsletters
  • About us
  • Contact us
  • Write for us
  • Editorial guidelines
Legal
  • Privacy Policy
  • Cookies Policy
  • Terms & Conditions
  • DMCA
  • Disclaimer
  • Accessibility Policy
  • Security Policy
  • Do Not Sell or Share My Personal Information
Socials
Follow US

Disclosure: We love the products we feature and hope you’ll love them too. If you purchase through a link on our site, we may receive compensation at no additional cost to you. Read our ethics statement. Please note that pricing and availability are subject to change.

Copyright © 2026 GadgetBond. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | Do Not Sell/Share My Personal Information.