GadgetBond

  • Latest
  • How-to
  • Tech
    • AI
    • Amazon
    • Apple
    • CES
    • Computing
    • Creators
    • Google
    • Meta
    • Microsoft
    • Mobile
    • Samsung
    • Security
    • Xbox
  • Transportation
    • Audi
    • BMW
    • Cadillac
    • E-Bike
    • Ferrari
    • Ford
    • Honda Prelude
    • Lamborghini
    • McLaren W1
    • Mercedes
    • Porsche
    • Rivian
    • Tesla
  • Culture
    • Apple TV
    • Disney
    • Gaming
    • Hulu
    • Marvel
    • HBO Max
    • Netflix
    • Paramount
    • SHOWTIME
    • Star Wars
    • Streaming
Add GadgetBond as a preferred source to see more of our stories on Google.
Font ResizerAa
GadgetBondGadgetBond
  • Latest
  • Tech
  • AI
  • Deals
  • How-to
  • Apps
  • Mobile
  • Gaming
  • Streaming
  • Transportation
Search
  • Latest
  • Deals
  • How-to
  • Tech
    • Amazon
    • Apple
    • CES
    • Computing
    • Creators
    • Google
    • Meta
    • Microsoft
    • Mobile
    • Samsung
    • Security
    • Xbox
  • AI
    • Anthropic
    • ChatGPT
    • ChatGPT Atlas
    • Gemini AI (formerly Bard)
    • Google DeepMind
    • Grok AI
    • Meta AI
    • Microsoft Copilot
    • OpenAI
    • Perplexity
    • xAI
  • Transportation
    • Audi
    • BMW
    • Cadillac
    • E-Bike
    • Ferrari
    • Ford
    • Honda Prelude
    • Lamborghini
    • McLaren W1
    • Mercedes
    • Porsche
    • Rivian
    • Tesla
  • Culture
    • Apple TV
    • Disney
    • Gaming
    • Hulu
    • Marvel
    • HBO Max
    • Netflix
    • Paramount
    • SHOWTIME
    • Star Wars
    • Streaming
Follow US
AIGoogleTech

Gemma 4 just got faster with new MTP draft models

Google is rolling out Multi-Token Prediction drafters for Gemma 4, pushing open models to respond up to three times faster without hurting output quality.

By
Shubham Sawarkar
Shubham Sawarkar's avatar
ByShubham Sawarkar
Editor-in-Chief
I’m a tech enthusiast who loves exploring gadgets, trends, and innovations. With certifications in CISCO Routing & Switching and Windows Server Administration, I bring a sharp...
Follow:
- Editor-in-Chief
May 6, 2026, 9:00 AM EDT
Share
We may get a commission from retail offers. Learn more
Dark blue promotional graphic for Gemma 4 featuring glowing blue particle streaks radiating toward the center, with the text “Gemma 4” and subtitle “Multi-Token Prediction Drafters” displayed prominently.
Image: Google
SHARE

Google is turning the speed dial way up on Gemma 4. The company has released a new set of Multi-Token Prediction (MTP) “drafters” that sit alongside the main Gemma 4 models and can make them up to three times faster at inference, without sacrificing the quality of the answers you get back.

If you think about how most large language models work today, they’re basically very smart but slightly plodding typists. They generate one token at a time, in strict order, constantly shuttling billions of parameters in and out of GPU memory just to decide the next word. That “one-token-at-a-time” loop is surprisingly constrained not by raw compute, but by memory bandwidth, which is why even powerful GPUs can feel weirdly underused while your model is generating at a modest pace. Google’s MTP drafters are designed to attack that bottleneck head-on.

The basic idea is speculative decoding, and it’s simpler than it sounds once you strip away the jargon. You pair a big, slow-but-smart “target” model like Gemma 4 31B with a smaller, faster “drafter” that tries to guess several future tokens in one go. While the main model is still doing its heavy computation, the drafter runs ahead and proposes a sequence of likely continuations, which the big model then checks in a single forward pass. If the target model agrees, it accepts the whole chunk — and even adds one more token of its own — so your app effectively gets multiple tokens in the time it used to take to generate just one.

Bar chart titled “Gemma 4 MTP Drafter Speed-ups” comparing token-per-second performance gains across Gemma 4 models and hardware platforms, including Samsung S26 mobile GPU, Pixel TPU, Apple M4, and NVIDIA A100, with speed improvements ranging from up to 1.5x to 3.1x.
Image: Google

This is where the claimed “up to 3x speedup” comes from. By offloading the easy parts of the job to a much smaller helper, the system uses idle compute that would otherwise just be waiting on memory, especially on consumer GPUs and laptops. Google says the MTP drafters are tuned so that you don’t trade speed for sloppiness: the larger Gemma 4 model still has veto power over every token, which means reasoning quality and accuracy stay effectively the same while latency drops.

Importantly, this is not just a lab demo. The MTP drafters cover the full Gemma 4 lineup: the 31B dense flagship, the 26B Mixture-of-Experts variant, and the smaller E2B and E4B models aimed at phones, tablets and other edge devices. That means the same speculative decoding trick works for everything from cloud-based coding agents to on-device chatbots and voice assistants that need to feel snappy on mobile hardware.

Your browser does not support the video tag.

Under the hood, Google has done some clever engineering, so these drafters aren’t just bolted on as an afterthought. The drafter models are small — think a handful of layers rather than a giant stack — and they reuse the target model’s internal activations and KV cache. That means they don’t waste time recomputing context; the main model already worked out, which is a big part of why the speedups are real and not just marketing. For the edge-focused E2B and E4B variants, Google has even added efficient clustering in the embedder step to make the final logits calculation faster, which becomes a major bottleneck on smaller devices.

If you’re a developer, the pitch is straightforward: faster responses, same brains. Less latency directly translates into smoother chat experiences, more natural-feeling voice interfaces, and more capable multi-step agents that can plan and react without constant awkward pauses. On the local side, Google specifically calls out running the 26B MoE and 31B dense models on personal machines and consumer GPUs with what they describe as “unprecedented” speed, so offline coding assistants and local agent workflows become much more realistic.

There’s also a throughput angle, not just latency. On GPUs like NVIDIA’s A100 or on Apple Silicon, processing multiple requests together with larger batch sizes lets speculative decoding really shine, with Google observing speedups of around 2.2x in some local Apple Silicon scenarios when you bump the batch from 1 to the 4–8 range. That matters for anyone running shared backends or multi-user tools, where keeping the GPU busier without increasing costs per request is a big win.

From an ecosystem point of view, Google is clearly trying to make this as accessible as possible. The MTP drafters ship under the same Apache 2.0 open-source license as Gemma 4 itself, and the weights are already live on major hubs like Hugging Face and Kaggle as “assistant” models that correspond to each main checkpoint. On the software side, support is rolling out across the usual suspects: Transformers, vLLM, SGLang, MLX for Apple Silicon, LiteRT-LM for edge deployments, plus Ollama for developers who prefer a more turnkey local setup.

In practical terms, using MTP with Gemma 4 in popular frameworks is meant to feel more like flipping a switch than rewriting your stack. In Hugging Face Transformers, for example, you load the main Gemma 4 model as the target, then load the matching “-assistant” drafter and pass it as an assistant_model argument when you call generate. vLLM exposes it via a speculative decoding config flag, and some on-device LiteRT exports bake the MTP structures directly into the graph so that edge runtimes can exploit them with minimal custom code.

The timing of this release is also interesting when you zoom out a bit. Gemma 4 has only been around for a few weeks and has already crossed roughly 60 million downloads, which is a huge number for an open-weight model family that isn’t branded as the “main” Google assistant product. Shipping speculative decoding drafters this quickly suggests Google sees Gemma not just as a research artifact, but as a serious platform it wants people to build real products on — including those that need to run on personal hardware, not just in hyperscale data centers.

From a user’s perspective, you’re not going to see “MTP drafter engaged” on screen, but you will feel it in how the model behaves. Messages start streaming faster, long-form answers land sooner, and coding sessions with an AI pair programmer feel less like waiting for a remote server and more like interacting with a responsive local tool. On phones and tablets, the gains are even more visible because every millisecond saved is one less hit to the battery and one more reason to keep tasks on-device instead of round-tripping to the cloud.

There is also a broader trend here: speculative decoding is becoming a standard ingredient in modern LLM serving stacks, not a quirky research trick. We’ve already seen community-driven efforts like Medusa and Eagle pushing parallel token generation, and Google’s MTP implementation for Gemma 4 is another sign that “single-token, strictly autoregressive decoding” is slowly evolving into “smart, parallel-verified drafting” for any serious, latency-sensitive workload.

All told, Google’s MTP drafters for Gemma 4 are less about a flashy new model and more about making existing intelligence feel usable in real time. With open weights, broad framework support, and tangible speed gains on everything from workstation GPUs to mobile devices, this is the kind of under-the-hood upgrade that developers notice first — and end users quietly benefit from every time their AI stops feeling like it’s thinking in slow motion.


Discover more from GadgetBond

Subscribe to get the latest posts sent to your email.

Topic:Gemini AI (formerly Bard)Google DeepMind
Leave a Comment

Leave a ReplyCancel reply

Most Popular

Anthropic bundles chat, Cowork, and Code into one enterprise desktop app

Anthropic launches Japan Claude Community Ambassador program after 290+ global meetups

Summer Sale gives Nothing’s lineup a more tempting price tag

Samsung rolls out ChatGPT Enterprise to all employees worldwide

Also Read
Screenshot of the Perplexity Computer interface featuring a command panel for AI-powered tasks and automation. The dashboard includes a search bar, an Orchestrator mode selector, Deep Research tools, custom skills, and planning options, designed to help users perform research, workflows, and computer-assisted tasks.

Perplexity Computer adds a Command Panel

Collage of four web-based artifacts created with Claude Code, including an analytics dashboard, a mobile app design showcase, a software migration report, and a systems workflow visualization. The examples demonstrate interactive interfaces, data-rich dashboards, design systems, and technical documentation generated through AI-assisted development.

Live artifacts come to Claude Code

Illustration of a Claude Connectors settings panel with organization-wide access enabled. A large toggle switch labeled “Enable for organization” is turned on, and a hand-shaped cursor points to it. Below, a list of connected apps—Asana, Atlassian, Canva, Figma, and Granola—each displays an enabled blue toggle switch. The interface appears on a light gray background with a clean, minimalist design.

Claude just solved the enterprise AI authorization headache — and it only took one login

OpenAI logo centered on a gradient background with vibrant shades of red, pink, and orange. The logo features a bold black geometric pattern of interlocking hexagonal shapes.

AI-assisted genomic reanalysis offers new hope for families facing rare disorders

ALT text: Colorful promotional graphic featuring large white text “GPT-5.5” centered over a soft pastel flower-like abstract background in shades of pink, orange, purple, and blue on a light blue backdrop. The design has a smooth, vibrant, and modern gradient aesthetic.

ChatGPT GPT-5.5 Instant brings physician-led health intelligence to millions

Administrative billing dashboard for an organization showing subscription and usage details. The interface includes a sidebar with sections for Analytics, Identity & Access, Billing, and Agents. The main panel displays an Enterprise License with seat allocations for Codex and ChatGPT, current seat usage, account balance information, and a yearly usage trend chart. Additional sections for limits, alerts, invoices, and billing activity are visible within a clean, modern management console.

OpenAI rolls out usage analytics and spend controls for ChatGPT Enterprise

Abstract 3D visualization of a connected network represented as a dark globe covered with intersecting lines and glowing spherical nodes. The illuminated points appear linked across the curved surface, symbolizing artificial intelligence, neural networks, global data connections, and knowledge processing.

Perplexity launches Brain for its Computer agent

Simple illustration of a shopping bag with a keyhole symbol on the front, representing secure or private shopping, on a solid orange background.

Anthropic killed the API key (for workloads, at least)

Company Info
  • Homepage
  • Support my work
  • Latest stories
  • Company updates
  • GDB Recommends
  • Daily newsletters
  • About us
  • Contact us
  • Write for us
  • Editorial guidelines
Legal
  • Privacy Policy
  • Cookies Policy
  • Terms & Conditions
  • DMCA
  • Disclaimer
  • Accessibility Policy
  • Security Policy
  • Do Not Sell or Share My Personal Information
Socials
Follow US

Disclosure: We love the products we feature and hope you’ll love them too. If you purchase through a link on our site, we may receive compensation at no additional cost to you. Read our ethics statement. Please note that pricing and availability are subject to change.

Copyright © 2026 GadgetBond. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | Do Not Sell/Share My Personal Information.