Google’s new Gemini Embedding 2 supercharges multimodal RAG

Google has quietly dropped one of its most interesting Gemini updates yet — a model that doesn’t just read your text, but also sees your images, watches your videos, listens to your audio, and parses your PDFs, then throws everything into the same mathematical space. It’s called Gemini Embedding 2, it’s in public preview through the Gemini API and Vertex AI, and if you care about RAG, semantic search, or any kind of “find the right thing in a massive pile of content” problem, this is a big deal.

At a high level, embeddings are just vectors — long lists of numbers that represent the “meaning” of something. Traditionally, you had separate models for different modalities: one for text search, another like CLIP for images, maybe a custom thing for audio. Gemini Embedding 2 collapses all that complexity and says: send me text, images, video, audio, or PDFs and I’ll map them into one unified embedding space where everything can be compared directly. That means you can do things like: search a video archive using a sentence, find images using an audio clip, or use a screenshot as a query to retrieve documents.

Under the hood, this model leans heavily on the multimodal understanding baked into the Gemini architecture. It supports five main input types: text (up to 8,192 tokens), up to six images per request (PNG, JPEG), up to 120 seconds of video (MP4, MOV), raw audio without forcing you through speech-to-text, and PDFs up to six pages. Crucially, it’s not limited to one modality at a time — you can interleave text and images, or mix video frames with audio, and the model jointly embeds the whole thing, capturing relationships such as “this caption refers to that part of the image” or “this spoken line relates to that on-screen action.”

On the output side, Google is sticking with high-dimensional vectors but making them more flexible. By default, Gemini Embedding 2 emits 3,072-dimensional embeddings, but it uses Matryoshka Representation Learning (MRL) to “nest” information so you can safely truncate down to 1,536 or 768 dimensions while retaining a lot of the semantic power. This matters in practice because vector databases can get expensive: smaller vectors mean cheaper storage, faster indexing, and more responsive search, while still letting you switch to full 3,072-dimension embeddings when you need maximum accuracy, for example, in a reranking stage. Providers like Qdrant are already describing two-pass retrieval setups where you scan with lower-dimension vectors, then rescore top candidates with the full-size ones.

Google is framing this as a state-of-the-art step up from its earlier, mostly text-focused embedding models — especially for multilingual and multimodal tasks. The original Gemini Embedding work already showed strong results on MTEB benchmarks across classification, clustering, and retrieval, outscoring other popular open and commercial models on both English and multilingual leaderboards. Gemini Embedding 2 builds on that but extends it into speech, images, and video, and Google says it now outperforms leading alternatives across text, image, and video tasks while adding robust speech understanding. For developers, the punchline is that you no longer need a patchwork of different models, glue logic, and ad-hoc scoring tricks to get decent multimodal retrieval.

The real shift is what this unlocks for everyday AI workflows. Retrieval-Augmented Generation (RAG) becomes more than “search a PDF store by text and stuff snippets into a prompt”; you can now build systems that retrieve across email-style text, design mocks, product photos, marketing videos, and recorded calls, all inside the same pipeline. Semantic search stops being text-only and starts to look more like “find me anything in this organization that matches this idea, regardless of format.” Classic tasks like sentiment analysis and clustering also benefit because the model isn’t blind to non-textual signals: it can, for instance, cluster video clips by theme or emotion, or group customer tickets that include screenshots and logs, not just words.

Google is already pointing to early adopters to make this less abstract. Media companies, for example, are using Gemini Embedding 2 to search huge archives of B-roll, editorial footage, and untranscribed content. One quoted partner reports that with the new embeddings, simple text queries can now retrieve very specific video shots — including subtle, previously untranscribed micro-expressions — and even allow using an image or a random B-roll clip as the query to discover similar footage. In one internal test, this took their text-to-video Recall@1 to 85.3%, a big jump in how often the top result was exactly the right clip. For anyone who has ever tried to dig through a chaotic media drive, that kind of “it just finds the exact shot I had in mind” experience is a huge productivity boost.

On the platform side, Google is making sure this model is accessible in all the usual modern AI developer workflows. You can hit it directly via the Gemini API or through Vertex AI, which exposes it as gemini-embedding-2-preview and documents how to send different media and tune parameters like task (for retrieval, code search, or custom semantics) and output_dimensionality. If you live in the vector database ecosystem, you get ready-made integrations with LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, ChromaDB, and Vertex Vector Search, so plugging it into an existing stack is mostly a matter of swapping out the embedding backend and reindexing. There are also Colab notebooks from Google that walk through setting up semantic search and multimodal RAG pipelines end-to-end.

For all the engineering talk, the practical examples are where this gets interesting. Imagine a support system where a user uploads a shaky phone video of a device problem; the system embeds the audio, video, and any overlaid text together, then retrieves the most relevant troubleshooting guides, past tickets, and internal docs in one shot. Or an internal search tool where a designer can drag in a prototype screenshot and instantly find the latest Figma specs, design tokens, and even related product requirement docs, without hand-curated tags. In enterprise settings, this could stretch to compliance (searching across recorded calls, scanned documents, and dashboards) or creative workflows (matching music and visuals, curating highlight reels from huge video dumps) — essentially anywhere you have a messy mix of formats and a vague question in natural language.

There are also clear cost and operations angles here. Because of MRL and adjustable dimensionality, teams can start with lower-dimension embeddings for broad recall, then selectively bump up to full-size vectors in smaller, high-precision stages. This pattern plays nicely with multi-stage retrieval architectures that are becoming standard in large-scale RAG and search systems, where you blend fast approximate search with slower, more accurate reranking. And, since the model is multilingual across more than 100 languages, global products can unify their search infrastructure instead of managing separate models and indexes per language.

The launch timing also fits Google’s broader strategy: keep Gemini at the center, but ship specialized pieces that solve very real, very unsexy infrastructure problems for developers — retrieval quality, multimodal search, latency, and cost. LLMs get the headlines, but embeddings quietly decide whether your chatbot actually finds the right context or your “AI search” feels magical versus mediocre. With Gemini Embedding 2, Google is betting that a single, natively multimodal embedding backbone will become the default for these retrieval-heavy AI apps, especially as more companies move from text-only workflows to truly mixed media.