32K Context Hits IBM's Open Multilingual Embedding R2 Models

On May 14, 2026, IBM Research launched the Granite Embedding Multilingual R2 family of open-source encoder models. Licensed under Apache 2.0, this release expands the context window to 32,768 tokens—a 64x increase over the 512-token limit of the R1 generation. The models are designed for enterprise-scale dense retrieval across more than 200 languages and include enhanced support for programming languages like Python, Java, Go, and SQL.

Architecture and Inference Footprint

The R2 generation shifts to the ModernBERT architecture. This foundation leverages alternating attention lengths, Rotary Position Embeddings (RoPE), and Flash Attention 2.0 to optimize sequence processing across long contexts. By handling larger document chunks natively, the architecture simplifies the preprocessing pipelines required for complex retrieval-augmented generation systems.

Despite the expanded 32K context window, the models maintain processing speeds suitable for high-volume indexing. The compact 97M variant processes approximately 2,900 documents per second on a single NVIDIA H100 GPU. Both model weights are published alongside ONNX and OpenVINO formats to support flexible AI inference deployments across GPU clusters, CPUs, and edge hardware.

Model Variants and Benchmark Results

The release splits into two primary bi-encoder models designed for different infrastructure constraints.

Model	Parameters	Base Dimension	MTEB-v2 Retrieval Score
granite-embedding-311m-multilingual-r2	311M	768	65.2
granite-embedding-97m-multilingual-r2	97M	384	60.3

The full-size 311M model utilizes Matryoshka Representation Learning (MRL). This allows developers to truncate the default 768-dimension embeddings down to 512, 384, 256, or 128 dimensions with minimal accuracy loss, giving engineering teams a dial to control vector database storage costs. The 65.2 score on the MTEB-v2 Retrieval benchmark places it in the top 3 of open multilingual models under 500M parameters.

The 97M compact variant achieves its footprint through aggressive layer pruning, reducing the original 22 layers to 12. IBM also compressed the vocabulary selection from 262K to 180K tokens before applying knowledge distillation. Its 60.3 retrieval score establishes a 9-point lead over competing open multilingual models in the sub-100M parameter class.

If your ingestion pipeline currently shreds large technical documents or mixed-language codebases into aggressive 512-token chunks, the 32K context window allows you to embed entire source files and architectural specs intact. Test the 97M model first for standard multilingual workflows; its knowledge-distilled architecture offers the optimal performance-to-cost ratio unless your specific dataset requires the higher-dimensional precision of the 311M variant.

32K Context Hits IBM's Open Multilingual Embedding R2 Models

Architecture and Inference Footprint

Model Variants and Benchmark Results

Keep Reading

Train Multimodal Sentence Transformers for Visual Retrieval

Private Evaluation Track Deters Open ASR Benchmaxxing

Outpacing Whisper: Cohere Transcribe Hits Top ASR Speed

How to Use Multimodal Sentence Transformers v5.4

GLM-5.1 MoE Beats GPT-5.4 in Open-Source Engineering Milestone