Train Multimodal Sentence Transformers for Visual Retrieval
Learn how to finetune multimodal embedding and reranker models for text, image, and audio using the updated Sentence Transformers library.
Hugging Face’s April 2026 update to the Sentence Transformers library brings native multimodal training to version 5.4. You can now train embedding and reranker models that process text, images, audio, and video using the standard Trainer API. This release introduces automatic modality detection, specialized loss functions for Vision-Language Models, and modular routing for custom architectures.
Installation and Modality Detection
The updated framework requires the image extras package for vision tasks.
pip install -U sentence-transformers[image]
Version 5.4 introduces first-class multimodal support. The library handles PIL images, local file paths, remote URLs, and audio arrays natively. You pass these directly to the model exactly as you would pass text strings.
The core Transformer module reads processor configurations automatically. It identifies the supported inputs for any loaded architecture. You can verify the exact capabilities of your current model using the model.modalities property or the model.supports() method. This automatic detection prevents runtime errors during mixed-batch processing.
Multimodal Embedding Loss Functions
Training robust embeddings across different data types requires memory-efficient loss calculations. Vision-Language Models consume significant memory, making standard contrastive loss approaches difficult to scale. Version 5.4 provides three primary loss strategies tailored for multimodal datasets.
| Loss Function | Primary Mechanism | Best Use Case |
|---|---|---|
| CachedMultipleNegativesRankingLoss | Uses GradCache to separate forward and backward passes. | Scaling batch sizes for retrieval tasks without exhausting GPU memory. |
| MatryoshkaLoss | Forces accurate representations across truncated dimensions. | Producing flexible embeddings that scale from 1536 down to 64 dimensions. |
| Hardness-Weighted Contrastive Learning | Up-weights hard negative examples in the softmax calculation. | Improving edge-case retrieval inspired by the 2025 LLaVE architecture. |
The CachedMultipleNegativesRankingLoss is critical for large-scale training. Contrastive learning relies on large batch sizes to provide sufficient negative examples. GradCache allows you to simulate these large batches effectively while operating within the physical constraints of your hardware.
Training Multimodal Rerankers
Rerankers evaluate the specific relevance of a query-document pair. The library now supports training multimodal Cross-Encoders using two distinct structural approaches.
The Any-to-Any plus LogitScore approach utilizes a Causal Language Model to generate an output token. The system calculates relevance by scoring the log-odds of a positive token against a negative token. A typical implementation scores the probability of the model outputting a “1” versus a “0” when evaluating the match between an image and a text query. The official documentation contains the full implementation details for configuring the logit scoring parameters.
The Feature Extraction plus Pooling plus Dense architecture provides a memory-efficient alternative. It projects pooled embeddings from the base model directly to a single relevance score. This avoids the computational overhead associated with generating tokens. You should evaluate this approach when deploying rerankers for high-throughput AI inference. Both architectures are supported by the new fully Modular CrossEncoder framework.
Building Custom Architectures with the Router Module
You are not restricted to monolithic Vision-Language Models. The new Router module allows you to build custom multimodal models by composing separate, specialized encoders.
You can pair a lightweight text encoder like MiniLM with a dedicated vision encoder like SigLIP. The router sits in front of these sub-models and directs incoming batches to the appropriate encoder based on the detected modality. This modularity allows you to update the vision component without retraining the text representations. The Hugging Face repository provides comprehensive training scripts for assembling these composite architectures.
The Visual Document Retrieval Baseline
The primary demonstration of the v5.4 capabilities is tomaarsen/Qwen3-VL-Embedding-2B-vdr. This model is explicitly finetuned for Visual Document Retrieval. This task involves matching raw text queries directly to document screenshots containing complex visual structures like charts and tables.
The finetuned version achieves an NDCG@10 of 0.947 on standard benchmarks. This represents a substantial gain over the base model’s 0.888 score. This two-billion parameter model outperforms existing Visual Document Retrieval models that are up to four times larger. Building a domain-specific embedding model using these new training pipelines allows you to achieve similar efficiency gains on your proprietary data.
Memory Management and Tradeoffs
Processing dense visual inputs alongside text creates severe memory bottlenecks. Sentence Transformers v5.4 integrates Flash Attention 2 to manage this overhead.
The integration includes automatic input flattening. When the system detects a text-only input within a batch, it skips the standard padding operations to save memory. Mixing modalities haphazardly within a single forward pass forces the system to bypass these text-specific optimizations.
Relying on the router module with multiple large encoders increases the total parameter count loaded into memory. You must account for the combined footprint of both models when sizing your training cluster.
Review the Any-to-Any multimodal reranking examples in the official repository to structure your initial training loops. Format your multimodal pairs using standard dataset arrays before initializing the Trainer API.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Multitask Seamlessly with Chrome’s New Split-Screen AI Mode
Google’s latest Chrome update introduces AI Mode, featuring a split-screen interface and multi-tab bundling to streamline complex research and shopping.
How to Use Multimodal Sentence Transformers v5.4
Learn to implement multimodal embedding and reranker models using Sentence Transformers for advanced search across text, images, audio, and video.
Gemini 1.5 Flash Now Does Real-Time Voice
The new Multimodal Live API enables developers to build low-latency, expressive speech-to-speech applications with advanced emotional inflection.
Muse Spark Is Meta’s First Closed-Source Foundation Model
Meta Superintelligence Labs unveils Muse Spark, a natively multimodal model featuring advanced reasoning modes and 10x compute efficiency compared to Llama 4.
Gemma 4 Arrives With Full Apache 2.0 License
Google releases Gemma 4, a new generation of open models optimized for advanced reasoning, agentic workflows, and high-performance edge deployment.