GadgetBond

  • Latest
  • How-to
  • Tech
    • AI
    • Amazon
    • Apple
    • CES
    • Computing
    • Creators
    • Google
    • Meta
    • Microsoft
    • Mobile
    • Samsung
    • Security
    • Xbox
  • Transportation
    • Audi
    • BMW
    • Cadillac
    • E-Bike
    • Ferrari
    • Ford
    • Honda Prelude
    • Lamborghini
    • McLaren W1
    • Mercedes
    • Porsche
    • Rivian
    • Tesla
  • Culture
    • Apple TV
    • Disney
    • Gaming
    • Hulu
    • Marvel
    • HBO Max
    • Netflix
    • Paramount
    • SHOWTIME
    • Star Wars
    • Streaming
Add GadgetBond as a preferred source to see more of our stories on Google.
Font ResizerAa
GadgetBondGadgetBond
  • Latest
  • Tech
  • AI
  • Deals
  • How-to
  • Apps
  • Mobile
  • Gaming
  • Streaming
  • Transportation
Search
  • Latest
  • Deals
  • How-to
  • Tech
    • Amazon
    • Apple
    • CES
    • Computing
    • Creators
    • Google
    • Meta
    • Microsoft
    • Mobile
    • Samsung
    • Security
    • Xbox
  • AI
    • Anthropic
    • ChatGPT
    • ChatGPT Atlas
    • Gemini AI (formerly Bard)
    • Google DeepMind
    • Grok AI
    • Meta AI
    • Microsoft Copilot
    • OpenAI
    • Perplexity
    • xAI
  • Transportation
    • Audi
    • BMW
    • Cadillac
    • E-Bike
    • Ferrari
    • Ford
    • Honda Prelude
    • Lamborghini
    • McLaren W1
    • Mercedes
    • Porsche
    • Rivian
    • Tesla
  • Culture
    • Apple TV
    • Disney
    • Gaming
    • Hulu
    • Marvel
    • HBO Max
    • Netflix
    • Paramount
    • SHOWTIME
    • Star Wars
    • Streaming
Follow US
AIGoogleTech

Gemma 4 just got faster with new MTP draft models

Google is rolling out Multi-Token Prediction drafters for Gemma 4, pushing open models to respond up to three times faster without hurting output quality.

By
Shubham Sawarkar
Shubham Sawarkar's avatar
ByShubham Sawarkar
Editor-in-Chief
I’m a tech enthusiast who loves exploring gadgets, trends, and innovations. With certifications in CISCO Routing & Switching and Windows Server Administration, I bring a sharp...
Follow:
- Editor-in-Chief
May 6, 2026, 9:00 AM EDT
Share
We may get a commission from retail offers. Learn more
Dark blue promotional graphic for Gemma 4 featuring glowing blue particle streaks radiating toward the center, with the text “Gemma 4” and subtitle “Multi-Token Prediction Drafters” displayed prominently.
Image: Google
SHARE

Google is turning the speed dial way up on Gemma 4. The company has released a new set of Multi-Token Prediction (MTP) “drafters” that sit alongside the main Gemma 4 models and can make them up to three times faster at inference, without sacrificing the quality of the answers you get back.

If you think about how most large language models work today, they’re basically very smart but slightly plodding typists. They generate one token at a time, in strict order, constantly shuttling billions of parameters in and out of GPU memory just to decide the next word. That “one-token-at-a-time” loop is surprisingly constrained not by raw compute, but by memory bandwidth, which is why even powerful GPUs can feel weirdly underused while your model is generating at a modest pace. Google’s MTP drafters are designed to attack that bottleneck head-on.

The basic idea is speculative decoding, and it’s simpler than it sounds once you strip away the jargon. You pair a big, slow-but-smart “target” model like Gemma 4 31B with a smaller, faster “drafter” that tries to guess several future tokens in one go. While the main model is still doing its heavy computation, the drafter runs ahead and proposes a sequence of likely continuations, which the big model then checks in a single forward pass. If the target model agrees, it accepts the whole chunk — and even adds one more token of its own — so your app effectively gets multiple tokens in the time it used to take to generate just one.

Bar chart titled “Gemma 4 MTP Drafter Speed-ups” comparing token-per-second performance gains across Gemma 4 models and hardware platforms, including Samsung S26 mobile GPU, Pixel TPU, Apple M4, and NVIDIA A100, with speed improvements ranging from up to 1.5x to 3.1x.
Image: Google

This is where the claimed “up to 3x speedup” comes from. By offloading the easy parts of the job to a much smaller helper, the system uses idle compute that would otherwise just be waiting on memory, especially on consumer GPUs and laptops. Google says the MTP drafters are tuned so that you don’t trade speed for sloppiness: the larger Gemma 4 model still has veto power over every token, which means reasoning quality and accuracy stay effectively the same while latency drops.

Importantly, this is not just a lab demo. The MTP drafters cover the full Gemma 4 lineup: the 31B dense flagship, the 26B Mixture-of-Experts variant, and the smaller E2B and E4B models aimed at phones, tablets and other edge devices. That means the same speculative decoding trick works for everything from cloud-based coding agents to on-device chatbots and voice assistants that need to feel snappy on mobile hardware.

Your browser does not support the video tag.

Under the hood, Google has done some clever engineering, so these drafters aren’t just bolted on as an afterthought. The drafter models are small — think a handful of layers rather than a giant stack — and they reuse the target model’s internal activations and KV cache. That means they don’t waste time recomputing context; the main model already worked out, which is a big part of why the speedups are real and not just marketing. For the edge-focused E2B and E4B variants, Google has even added efficient clustering in the embedder step to make the final logits calculation faster, which becomes a major bottleneck on smaller devices.

If you’re a developer, the pitch is straightforward: faster responses, same brains. Less latency directly translates into smoother chat experiences, more natural-feeling voice interfaces, and more capable multi-step agents that can plan and react without constant awkward pauses. On the local side, Google specifically calls out running the 26B MoE and 31B dense models on personal machines and consumer GPUs with what they describe as “unprecedented” speed, so offline coding assistants and local agent workflows become much more realistic.

There’s also a throughput angle, not just latency. On GPUs like NVIDIA’s A100 or on Apple Silicon, processing multiple requests together with larger batch sizes lets speculative decoding really shine, with Google observing speedups of around 2.2x in some local Apple Silicon scenarios when you bump the batch from 1 to the 4–8 range. That matters for anyone running shared backends or multi-user tools, where keeping the GPU busier without increasing costs per request is a big win.

From an ecosystem point of view, Google is clearly trying to make this as accessible as possible. The MTP drafters ship under the same Apache 2.0 open-source license as Gemma 4 itself, and the weights are already live on major hubs like Hugging Face and Kaggle as “assistant” models that correspond to each main checkpoint. On the software side, support is rolling out across the usual suspects: Transformers, vLLM, SGLang, MLX for Apple Silicon, LiteRT-LM for edge deployments, plus Ollama for developers who prefer a more turnkey local setup.

In practical terms, using MTP with Gemma 4 in popular frameworks is meant to feel more like flipping a switch than rewriting your stack. In Hugging Face Transformers, for example, you load the main Gemma 4 model as the target, then load the matching “-assistant” drafter and pass it as an assistant_model argument when you call generate. vLLM exposes it via a speculative decoding config flag, and some on-device LiteRT exports bake the MTP structures directly into the graph so that edge runtimes can exploit them with minimal custom code.

The timing of this release is also interesting when you zoom out a bit. Gemma 4 has only been around for a few weeks and has already crossed roughly 60 million downloads, which is a huge number for an open-weight model family that isn’t branded as the “main” Google assistant product. Shipping speculative decoding drafters this quickly suggests Google sees Gemma not just as a research artifact, but as a serious platform it wants people to build real products on — including those that need to run on personal hardware, not just in hyperscale data centers.

From a user’s perspective, you’re not going to see “MTP drafter engaged” on screen, but you will feel it in how the model behaves. Messages start streaming faster, long-form answers land sooner, and coding sessions with an AI pair programmer feel less like waiting for a remote server and more like interacting with a responsive local tool. On phones and tablets, the gains are even more visible because every millisecond saved is one less hit to the battery and one more reason to keep tasks on-device instead of round-tripping to the cloud.

There is also a broader trend here: speculative decoding is becoming a standard ingredient in modern LLM serving stacks, not a quirky research trick. We’ve already seen community-driven efforts like Medusa and Eagle pushing parallel token generation, and Google’s MTP implementation for Gemma 4 is another sign that “single-token, strictly autoregressive decoding” is slowly evolving into “smart, parallel-verified drafting” for any serious, latency-sensitive workload.

All told, Google’s MTP drafters for Gemma 4 are less about a flashy new model and more about making existing intelligence feel usable in real time. With open weights, broad framework support, and tangible speed gains on everything from workstation GPUs to mobile devices, this is the kind of under-the-hood upgrade that developers notice first — and end users quietly benefit from every time their AI stops feeling like it’s thinking in slow motion.


Discover more from GadgetBond

Subscribe to get the latest posts sent to your email.

Topic:Gemini AI (formerly Bard)Google DeepMind
Leave a Comment

Leave a ReplyCancel reply

Most Popular

Codex now runs natively inside Chrome on Mac and Windows

ASUS’ 12.3-inch ROG Strix XG129C is made to sit under your gaming monitor

Anthropic was “evil” in February, now it runs on Musk’s Colossus 1 GPUs

Claude for Microsoft 365 is now generally available

Anthropic’s SpaceX AI deal collides with data center backlash

Also Read
Illustration comparing Gmail writing suggestions before and after personalization. On the left, under the heading “Today,” a generic email draft to “Alex Liu” uses formal, template-style language with placeholder text. On the right, under “With personalization,” the same draft is rewritten in a more natural and conversational tone with specific influencer campaign details, highlighted text snippets, and a personalized sign-off. Along the right side are three colored labels reading “Personalized tone and style,” “Based on past emails,” and “Based on Drive files,” emphasizing how Gmail uses user context to improve writing suggestions.

Help me write in Gmail gets smarter with personalization

Abstract blue gradient background featuring a centered rounded-square icon with a minimalist blue audio waveform symbol, representing a real-time voice or audio AI interface.

OpenAI upgrades its Realtime API with three new voice AI models

Three smartphone mockups displaying a ChatGPT trusted contact safety feature. The first screen explains how adding a trusted contact can help someone receive support during serious mental health or safety concerns. The second screen shows a form for inviting a trusted contact with fields for name, phone, email, and consent confirmation. The third screen confirms that the invitation was sent and offers an option to send a personal note.

OpenAI adds an emergency-style Trusted Contact option inside ChatGPT settings

Futuristic digital artwork showing a glowing computer face icon inside a translucent glass-like sphere resting on a soft grassy surface. Floating reflective droplets surround the sphere against a dark black background, creating a surreal and minimalist sci-fi atmosphere.

The new Perplexity Mac app ships with Personal Computer

Icon of Apple App Store mobile application on iPhone.

Apple now allows gambling apps on Brazil App Store with license requirements

Apple logo on iPhone 11

Apple’s next chips may come from Intel’s fabs

ASUS Chromebook CM14 (CM1406) laptop

ASUS Chromebook CM14 packs Kompanio 540 power and 23-hour battery

Fitbit Air hero

Fitbit Air is the $99 screenless wearable made for Google Health Coach

Company Info
  • Homepage
  • Support my work
  • Latest stories
  • Company updates
  • GDB Recommends
  • Daily newsletters
  • About us
  • Contact us
  • Write for us
  • Editorial guidelines
Legal
  • Privacy Policy
  • Cookies Policy
  • Terms & Conditions
  • DMCA
  • Disclaimer
  • Accessibility Policy
  • Security Policy
  • Do Not Sell or Share My Personal Information
Socials
Follow US

Disclosure: We love the products we feature and hope you’ll love them too. If you purchase through a link on our site, we may receive compensation at no additional cost to you. Read our ethics statement. Please note that pricing and availability are subject to change.

Copyright © 2026 GadgetBond. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | Do Not Sell/Share My Personal Information.