AI

Google Releases Gemini Embedding 2, First Natively Multimodal Embedding Model

Tuesday, March 10, 2026Read Original

Details

  • Google launched Gemini Embedding 2 in public preview via Gemini API and Vertex AI, mapping text, images, videos, audio, and documents into a unified embedding space supporting over 100 languages.
  • Developed by Google on the Gemini architecture; successor to prior text-only models; integrates with tools like LangChain, LlamaIndex, and vector databases such as Weaviate and Qdrant.
  • Supports up to 8192 text tokens, 6 images (PNG/JPEG), 120s videos (MP4/MOV), native audio processing, 6-page PDFs; handles interleaved multimodal inputs like image+text; uses Matryoshka Representation Learning for flexible dimensions (3072, 1536, 768 recommended).
  • Expands from text-only predecessors by enabling native multimodal retrieval, outperforming leaders in text, image, video, and adding speech capabilities for tasks like RAG, semantic search, sentiment analysis, clustering.
  • Vertex AI docs confirm custom task instructions and 3072D vectors; no direct competitor launches verified in last 90 days, though OpenAI's text embeddings and Cohere's remain text-focused without native audio/video unification.

Impact

Google's Gemini Embedding 2 sets a new benchmark for multimodal embeddings, simplifying AI pipelines for RAG and search across diverse media and enabling richer real-world applications. It strengthens Google's lead in versatile AI tools against text-centric rivals like OpenAI, accelerating adoption in enterprise analytics and multimodal AI development.

Rift Dispatchpractical systems & stories, weekly