Google’s Gemma 3n AI can run locally on smartphones with 2GB RAM

When Google first unveiled its Gemma 3 family of AI models earlier this year, it hinted at a future where powerful language understanding wouldn’t be confined to data centers. On Thursday, the company delivered on that promise by releasing Gemma 3n—a fully open-source model refined for on-device use and capable of running on as little as 2GB of RAM. This means developers can deploy a sophisticated large language model (LLM) directly on smartphones and other low-resource hardware, making AI more accessible, private, and responsive than ever before.

In May, Google teased the “Nano” aspirations of Gemma 3n at its I/O developer conference, promising an AI small enough to fit in your pocket. The latest release confirms those ambitions: despite a raw parameter count of 5 billion (E2B) and 8 billion (E4B), Gemma 3n behaves like a 2 billion or 4 billion-parameter model in terms of memory footprint—just 2GB and 3GB of RAM, respectively. By offloading less-critical weights to “extra layer embeddings” handled by the CPU, Gemma 3n keeps its active memory lean, ensuring fast, local inference even on modest hardware.

At the heart of Gemma 3n lies Google’s Matryoshka Transformer, or MatFormer. Inspired by Russian nesting dolls, this architecture trains a larger model (E4B) alongside a nested smaller one (E2B), sharing weights where it counts and dynamically switching between the two during inference. The result is what Google calls a “mobile-first architecture,” where a single model package serves both high-performance and lightweight use cases without compromising quality.

A critical innovation enabling Gemma 3n’s efficiency is Per-Layer Embeddings (PLE). Traditional transformers load all parameters into GPU memory, but PLE keeps only the most essential ones in fast memory (VRAM), shuffling the rest in and out via the CPU. Coupled with activation quantization and key-value-cache (KVC) sharing, this approach slashes RAM requirements and speeds up response times—on mobile devices, Gemma 3n is roughly 1.5× faster than its predecessor, Gemma 3 4B, while delivering superior output quality.

Gemma 3n isn’t just a text model—it’s multimodal. It natively ingests images, audio, video, and text, though it currently generates text only. It’s also a global citizen, supporting 140 languages for text inputs and 35 languages when working with multimodal data, making it a versatile tool for developers worldwide. Best of all, Google has open-sourced both the model weights and a “cookbook” of recipe-style instructions for fine-tuning and deployment under a permissive Gemma license, allowing academic researchers and commercial teams alike to innovate without legal headaches.

Not content to ship one-size-fits-all models, Google is also releasing MatFormer Lab—a toolkit that lets developers experiment with different nesting depths, parameter allocations, and quantization settings to craft custom-sized Gemma derivatives. Whether you need a hyper-lean 1 B-equivalent model for a microcontroller or a beefier 6 B-equivalent engine for a laptop, MatFormer Lab provides a playground to tune for specific latency, memory, and accuracy trade-offs.

You don’t need a PhD to start tinkering with Gemma 3n. The full model line is available on Hugging Face and via Google’s Kaggle listings. For a zero-install trial, head to Google AI Studio, where you can spin up inference jobs or even deploy directly to Cloud Run. Want to see it in action? Google provides sample notebooks, Docker images, and command-line scripts to help you integrate Gemma 3n into chatbots, virtual assistants, or edge-AI demos within minutes.