Ai Engineering 2 min read

32K Context Hits IBM's Open Multilingual Embedding R2 Models

IBM released Granite Embedding Multilingual R2, upgrading its Apache 2.0 encoder models with a 32,768-token context window and ModernBERT architecture.

On May 14, 2026, IBM Research launched the Granite Embedding Multilingual R2 family of open-source encoder models. Licensed under Apache 2.0, this release expands the context window to 32,768 tokens—a 64x increase over the 512-token limit of the R1 generation. The models are designed for enterprise-scale dense retrieval across more than 200 languages and include enhanced support for programming languages like Python, Java, Go, and SQL.

Architecture and Inference Footprint

The R2 generation shifts to the ModernBERT architecture. This foundation leverages alternating attention lengths, Rotary Position Embeddings (RoPE), and Flash Attention 2.0 to optimize sequence processing across long contexts. By handling larger document chunks natively, the architecture simplifies the preprocessing pipelines required for complex retrieval-augmented generation systems.

Despite the expanded 32K context window, the models maintain processing speeds suitable for high-volume indexing. The compact 97M variant processes approximately 2,900 documents per second on a single NVIDIA H100 GPU. Both model weights are published alongside ONNX and OpenVINO formats to support flexible AI inference deployments across GPU clusters, CPUs, and edge hardware.

Model Variants and Benchmark Results

The release splits into two primary bi-encoder models designed for different infrastructure constraints.

ModelParametersBase DimensionMTEB-v2 Retrieval Score
granite-embedding-311m-multilingual-r2311M76865.2
granite-embedding-97m-multilingual-r297M38460.3

The full-size 311M model utilizes Matryoshka Representation Learning (MRL). This allows developers to truncate the default 768-dimension embeddings down to 512, 384, 256, or 128 dimensions with minimal accuracy loss, giving engineering teams a dial to control vector database storage costs. The 65.2 score on the MTEB-v2 Retrieval benchmark places it in the top 3 of open multilingual models under 500M parameters.

The 97M compact variant achieves its footprint through aggressive layer pruning, reducing the original 22 layers to 12. IBM also compressed the vocabulary selection from 262K to 180K tokens before applying knowledge distillation. Its 60.3 retrieval score establishes a 9-point lead over competing open multilingual models in the sub-100M parameter class.

If your ingestion pipeline currently shreds large technical documents or mixed-language codebases into aggressive 512-token chunks, the 32K context window allows you to embed entire source files and architectural specs intact. Test the 97M model first for standard multilingual workflows; its knowledge-distilled architecture offers the optimal performance-to-cost ratio unless your specific dataset requires the higher-dimensional precision of the 311M variant.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading