Train Multimodal Sentence Transformers for Visual Retrieval

Hugging Face’s April 2026 update to the Sentence Transformers library brings native multimodal training to version 5.4. You can now train embedding and reranker models that process text, images, audio, and video using the standard Trainer API. This release introduces automatic modality detection, specialized loss functions for Vision-Language Models, and modular routing for custom architectures.

Installation and Modality Detection

The updated framework requires the image extras package for vision tasks.

pip install -U sentence-transformers[image]

Version 5.4 introduces first-class multimodal support. The library handles PIL images, local file paths, remote URLs, and audio arrays natively. You pass these directly to the model exactly as you would pass text strings.

The core Transformer module reads processor configurations automatically. It identifies the supported inputs for any loaded architecture. You can verify the exact capabilities of your current model using the model.modalities property or the model.supports() method. This automatic detection prevents runtime errors during mixed-batch processing.

Multimodal Embedding Loss Functions

Training robust embeddings across different data types requires memory-efficient loss calculations. Vision-Language Models consume significant memory, making standard contrastive loss approaches difficult to scale. Version 5.4 provides three primary loss strategies tailored for multimodal datasets.

Loss Function	Primary Mechanism	Best Use Case
CachedMultipleNegativesRankingLoss	Uses GradCache to separate forward and backward passes.	Scaling batch sizes for retrieval tasks without exhausting GPU memory.
MatryoshkaLoss	Forces accurate representations across truncated dimensions.	Producing flexible embeddings that scale from 1536 down to 64 dimensions.
Hardness-Weighted Contrastive Learning	Up-weights hard negative examples in the softmax calculation.	Improving edge-case retrieval inspired by the 2025 LLaVE architecture.

The CachedMultipleNegativesRankingLoss is critical for large-scale training. Contrastive learning relies on large batch sizes to provide sufficient negative examples. GradCache allows you to simulate these large batches effectively while operating within the physical constraints of your hardware.

Training Multimodal Rerankers

Rerankers evaluate the specific relevance of a query-document pair. The library now supports training multimodal Cross-Encoders using two distinct structural approaches.

The Any-to-Any plus LogitScore approach utilizes a Causal Language Model to generate an output token. The system calculates relevance by scoring the log-odds of a positive token against a negative token. A typical implementation scores the probability of the model outputting a “1” versus a “0” when evaluating the match between an image and a text query. The official documentation contains the full implementation details for configuring the logit scoring parameters.

The Feature Extraction plus Pooling plus Dense architecture provides a memory-efficient alternative. It projects pooled embeddings from the base model directly to a single relevance score. This avoids the computational overhead associated with generating tokens. You should evaluate this approach when deploying rerankers for high-throughput AI inference. Both architectures are supported by the new fully Modular CrossEncoder framework.

Building Custom Architectures with the Router Module

You are not restricted to monolithic Vision-Language Models. The new Router module allows you to build custom multimodal models by composing separate, specialized encoders.

You can pair a lightweight text encoder like MiniLM with a dedicated vision encoder like SigLIP. The router sits in front of these sub-models and directs incoming batches to the appropriate encoder based on the detected modality. This modularity allows you to update the vision component without retraining the text representations. The Hugging Face repository provides comprehensive training scripts for assembling these composite architectures.

The Visual Document Retrieval Baseline

The primary demonstration of the v5.4 capabilities is tomaarsen/Qwen3-VL-Embedding-2B-vdr. This model is explicitly finetuned for Visual Document Retrieval. This task involves matching raw text queries directly to document screenshots containing complex visual structures like charts and tables.

The finetuned version achieves an NDCG@10 of 0.947 on standard benchmarks. This represents a substantial gain over the base model’s 0.888 score. This two-billion parameter model outperforms existing Visual Document Retrieval models that are up to four times larger. Building a domain-specific embedding model using these new training pipelines allows you to achieve similar efficiency gains on your proprietary data.

Memory Management and Tradeoffs

Processing dense visual inputs alongside text creates severe memory bottlenecks. Sentence Transformers v5.4 integrates Flash Attention 2 to manage this overhead.

The integration includes automatic input flattening. When the system detects a text-only input within a batch, it skips the standard padding operations to save memory. Mixing modalities haphazardly within a single forward pass forces the system to bypass these text-specific optimizations.

Relying on the router module with multiple large encoders increases the total parameter count loaded into memory. You must account for the combined footprint of both models when sizing your training cluster.

Review the Any-to-Any multimodal reranking examples in the official repository to structure your initial training loops. Format your multimodal pairs using standard dataset arrays before initializing the Trainer API.

Train Multimodal Sentence Transformers for Visual Retrieval

Installation and Modality Detection

Multimodal Embedding Loss Functions

Training Multimodal Rerankers

Building Custom Architectures with the Router Module

The Visual Document Retrieval Baseline

Memory Management and Tradeoffs

Keep Reading

Google Dreambeans Curates Personal Data Into 14 Daily Cartoons

How to Use Multimodal Sentence Transformers v5.4

Google Ships 9 Gemini Omni Demos Alongside 3.5 Flash

Project Canvas Renders Vector UI via Gemini 3.0 Ultra

Single-Weight Gemini Omni Unifies Multimodal Video Generation