Ai Engineering 4 min read

How to Build Cross-Modal RAG Pipelines With Gemini Embedding 2

Learn how to process text, images, video, and audio into a single semantic vector space using Google's natively multimodal Gemini Embedding 2 model.

Google recently announced the general availability of Gemini Embedding 2, a unified model that maps text, images, video, audio, and documents into a single semantic vector space. The official Gemini Embedding 2 release highlights its use for agentic multi-step reasoning across heterogeneous data sources. You can use it to build cross-modal pipelines where an audio clip can query a technical manual, or a text prompt can retrieve specific video frames. Here is how to configure its inputs, manage vector dimensions, and integrate it into your retrieval systems.

Supported Input Modalities and Constraints

Gemini Embedding 2 processes interleaved multimodal inputs in a single request. This eliminates the need to run separate OCR or transcription models before generating vectors. The model natively ingests five distinct modalities, each with specific limits per request:

  • Text: Up to 8,192 tokens.
  • Images: Up to 6 images per request. Supported formats include PNG, JPEG, WebP, BMP, HEIC, HEIF, and AVIF.
  • Video: Up to 120 seconds of MP4 or MOV footage. The model automatically extracts the audio track from the video and interleaves it with the visual frames to create a holistic representation.
  • Audio: Up to 180 seconds. The model processes the raw audio natively without requiring intermediate text transcription.
  • Documents: Up to 6 pages of PDFs. Document processing includes built-in Optical Character Recognition (OCR).

When designing your system, enforce these limits at the application layer before making API calls. Requests exceeding these boundaries will fail rather than truncate automatically.

Configuring Vector Dimensions with MRL

By default, Gemini Embedding 2 generates a 3,072-dimensional float vector. While larger dimensions capture more nuanced relationships between modalities, they also increase storage costs and retrieval latency in your vector database.

The model supports Matryoshka Representation Learning (MRL), which allows you to truncate the output vectors to smaller dimensions with minimal accuracy loss. You can specify a lower output dimension, such as 1536 or 768, directly in the API request.

Truncating to 1536 dimensions cuts your storage footprint in half and significantly accelerates distance calculations during search. Use the default 3072 dimensions when precision is the sole priority, but default to 1536 or 768 when optimizing for scale and latency. You should thoroughly evaluate AI output metrics before finalizing your production dimension size.

Optimizing for Specific Tasks

Not all embeddings in AI serve the same purpose. A vector optimized for clustering may perform poorly for semantic search. Gemini Embedding 2 allows you to steer the vector generation using the task_type parameter.

Specify the intended use case when making the request. Supported task types include:

  • task:search_query: Use this when embedding a user’s search string.
  • task:search_document: Use this when embedding the corpus data that will be stored in your database.
  • task:code_retrieval: Use this when the text payload contains source code rather than natural language.

Matching the task type to the operation ensures the model maps the data into the correct region of the vector space.

API Integration and Framework Support

The model is accessible through both the Gemini API and Vertex AI. Because the exact payload structure varies depending on your chosen endpoint, refer to the official API reference for the complete request schema.

If you use an orchestration framework, Gemini Embedding 2 includes native Day 0 integration for LangChain, LlamaIndex, and Haystack. You can configure your framework’s embedding class to target the new model and pass your multimodal payloads directly through their respective document objects.

Performance Benchmarks and Pricing

Gemini Embedding 2 establishes new baselines for multimodal retrieval. It scored 68.9 overall on the Massive Multimodal Embedding Benchmark (MMEB), and 68.32 on the Massive Text Embedding Benchmark (MTEB) for English. For video retrieval, it averaged 68.8 across the Vatex, MSR-VTT, and Youcook2 benchmarks, outpacing alternatives like Amazon Nova 2 and Voyage Multimodal 3.5.

This capability comes with a higher operational cost. Pricing is approximately $0.20 per million tokens. This is roughly 10x the cost of leading text-only models like OpenAI’s text-embedding-3.

When architecture planning, evaluate your exact workload. If your application exclusively searches text documents, standard text embedding models remain more cost-effective. Gemini Embedding 2 justifies its cost when you need true cross-modal Retrieval-Augmented Generation, such as allowing AI agents to query a database using images or raw audio diagnostics.

For your next step, configure a small test corpus containing text, images, and short audio clips, and generate MRL-truncated 768-dimension vectors to test cross-modal search latency in your current database.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading