Gemma 4 is the engine behind next-gen Gemini Nano on Android

Gemma 4 is Google’s new open AI model that’s quietly redefining what “on-device intelligence” can look like on Android phones and in Android Studio — and most of that magic happens locally, not in the cloud. It’s the foundation for the next-gen Gemini Nano 4, which means the same model that helps you code on your laptop can also power smart features right on your phone, with big gains in speed and battery life.

At a high level, Gemma 4 is a family of open models purpose-built for advanced reasoning and agent-style workflows rather than just simple chat replies. Google is shipping it in four sizes — Effective 2B (E2B), Effective 4B (E4B), a 26B Mixture-of-Experts variant, and a 31B dense model — all under the Apache 2.0 license, which makes it very attractive for serious commercial apps. Despite those modest parameter counts, the bigger variants punch above their weight: the 31B model currently ranks among the top open models on the Arena AI leaderboard, beating systems up to 20x larger in reasoning benchmarks. The smaller E2B and E4B versions are tuned for edge devices with long context windows (128K tokens) and multimodal support, so they can handle long chats, documents, and images without constantly calling the cloud.

On Android, Gemma 4 shows up in two main places: Android Studio and the device itself. In Android Studio, you can select Gemma 4 as your local model and get full Agent Mode support, meaning the AI can not only autocomplete or suggest snippets but also plan and execute multi-step edits across your codebase. Think of commands like “build a basic notes app with Compose,” “migrate all hardcoded strings to strings.xml,” or “add dark mode support,” and the agent will scan multiple files, propose changes, and apply them — all while your code never leaves your machine. Because inference runs locally on your GPU and RAM, you get fast responses, predictable performance, and no quota anxiety from cloud APIs.

The second pillar is on-device intelligence through AICore and ML Kit’s GenAI Prompt API. Gemma 4 is now the base model for Gemini Nano 4, which powers OS-level smart features and lets developers build their own AI-powered experiences that run directly on the phone. Google says the new Nano 4, built on Gemma 4, is up to four times faster than previous Nano generations and can cut battery usage by as much as 60 percent for AI workloads, which is a huge deal for anything that needs frequent, low-latency inference like live transcription, summarization, or smart camera features. With the AICore Developer Preview, devs can already prototype against Gemma 4’s E2B and E4B variants; that same code is expected to work on mass-market Gemini Nano 4 devices arriving later this year.

Crucially, Gemma 4 isn’t just about text — it’s multimodal out of the box, taking in text and images and, in Google’s broader positioning, tying into audio scenarios as well. That opens up very Android-specific ideas: a local assistant that understands what’s on your screen, camera-based troubleshooting in poor connectivity, or photo inbox triage where privacy really matters. Because all of this is anchored in an open, Apache-licensed family, Android OEMs and app developers can experiment aggressively: custom-tuned assistants for their devices, vertical-specific copilots that never send data off-device, or hybrid setups that fall back to cloud models only when absolutely necessary.

From a privacy and cost angle, Gemma 4 is Google’s clearest statement, yet that serious AI doesn’t have to live exclusively in the cloud. Local inference means your code, documents, and on-device data can stay where they are, which is key for regulated industries or security-conscious teams that still want modern AI tooling. It also means AI features become less of a per-request cost decision and more of an upfront hardware decision: if your dev box and target devices are powerful enough, you get rich agentic behavior “for free” at runtime. Google is even planning to add Gemma 4 to Android Bench so developers can see hard numbers on latency and performance across devices before committing to a specific model size.