Details
- Google launched Gemini Embedding 2 in public preview via Gemini API and Vertex AI, mapping text, images, videos, audio, and documents into a unified embedding space supporting over 100 languages.
- Developed by Google on the Gemini architecture; successor to prior text-only models; integrates with tools like LangChain, LlamaIndex, and vector databases such as Weaviate and Qdrant.
- Supports up to 8192 text tokens, 6 images (PNG/JPEG), 120s videos (MP4/MOV), native audio processing, 6-page PDFs; handles interleaved multimodal inputs like image+text; uses Matryoshka Representation Learning for flexible dimensions (3072, 1536, 768 recommended).
- Expands from text-only predecessors by enabling native multimodal retrieval, outperforming leaders in text, image, video, and adding speech capabilities for tasks like RAG, semantic search, sentiment analysis, clustering.
- Vertex AI docs confirm custom task instructions and 3072D vectors; no direct competitor launches verified in last 90 days, though OpenAI's text embeddings and Cohere's remain text-focused without native audio/video unification.
Impact
Google's Gemini Embedding 2 sets a new benchmark for multimodal embeddings, simplifying AI pipelines for RAG and search across diverse media and enabling richer real-world applications. It strengthens Google's lead in versatile AI tools against text-centric rivals like OpenAI, accelerating adoption in enterprise analytics and multimodal AI development.