Google Releases Gemini Embedding 2, First Natively Multimodal Embedding Model

Details

Google launched Gemini Embedding 2 in public preview via Gemini API and Vertex AI, mapping text, images, videos, audio, and documents into a unified embedding space supporting over 100 languages.
Developed by Google on the Gemini architecture; successor to prior text-only models; integrates with tools like LangChain, LlamaIndex, and vector databases such as Weaviate and Qdrant.
Supports up to 8192 text tokens, 6 images (PNG/JPEG), 120s videos (MP4/MOV), native audio processing, 6-page PDFs; handles interleaved multimodal inputs like image+text; uses Matryoshka Representation Learning for flexible dimensions (3072, 1536, 768 recommended).
Expands from text-only predecessors by enabling native multimodal retrieval, outperforming leaders in text, image, video, and adding speech capabilities for tasks like RAG, semantic search, sentiment analysis, clustering.
Vertex AI docs confirm custom task instructions and 3072D vectors; no direct competitor launches verified in last 90 days, though OpenAI's text embeddings and Cohere's remain text-focused without native audio/video unification.

Impact

Google's Gemini Embedding 2 sets a new benchmark for multimodal embeddings, simplifying AI pipelines for RAG and search across diverse media and enabling richer real-world applications. It strengthens Google's lead in versatile AI tools against text-centric rivals like OpenAI, accelerating adoption in enterprise analytics and multimodal AI development.