Bounding Boxes Arrive in Mistral OCR 4 for Agentic Retrieval
Mistral AI's mistral-ocr-4-0 release transitions from flat text extraction to structured document mapping with bounding boxes and 170-language support.
Mistral AI released OCR 4 on June 23, 2026, shifting its document intelligence engine from simple text extraction to structured document understanding. The mistral-ocr-4-0 model introduces native paragraph-level bounding boxes and block classification, targeting the ingestion requirements of enterprise agentic search pipelines.
Structured Document Mapping
The primary technical shift in OCR 4 is the transition from a flat text stream to a structured document map. The model classifies content into 13 structural labels, including text, title, list, table, image, equation, and code. This block classification allows downstream models to process tabular data or code snippets with the correct formatting context.
Native paragraph-level bounding box extraction allows systems to localize text on the original document. This enables front-end applications to render in-context highlighting and exact visual citations when retrieving information. Mistral also added inline confidence scores at both the page and word levels. This granular scoring allows automated pipelines to flag low-confidence extractions for human verification.
The model processes images alongside standard formats like PDF, DOC, PPT, and OpenDocument. Multilingual capabilities cover 170 languages across 10 language groups, with specific performance improvements noted for low-resource languages.
Throughput and Benchmark Performance
Mistral reports that independent annotators preferred OCR 4 over competing systems with a 72% average win rate. The model processes up to 2,000 pages per minute when deployed on a single GPU.
| Benchmark | Score |
|---|---|
| OmniDocBench | 93.07 |
| OlmOCRBench | 85.20 |
Deployment Channels and Integration
OCR 4 is available immediately through the Mistral API and Studio. Cloud deployments launched concurrently on Microsoft Foundry and Amazon SageMaker, with Snowflake support pending. Enterprise users requiring data residency can deploy the model as a self-hosted single container.
The model serves as the default ingestion component for the newly announced Mistral Search Toolkit, an open-source framework for composable search. Third-party platforms are already adopting the standard; the open-source platform Sparrow integrated OCR 4 as a cloud backend on launch day to convert documents into structured JSON. Mistral also announced that Mistral Medium 3.5, arriving June 24, is specifically tuned to reason over the structured data extracted by OCR 4. Developers who deploy Mistral Small 4 for multimodal reasoning can adapt similar architectures to pair OCR extraction with capable language models.
API Pricing Tiers
Pricing scales based on the level of processing required. Using the Batch-API discount halves the base extraction cost for asynchronous workloads.
| Service Level | Cost per 1,000 Pages |
|---|---|
| Raw Extraction (Batch) | $2.00 |
| Raw Extraction (Standard) | $4.00 |
| Annotated Document AI | $5.00 |
If you build a RAG application, updating your ingestion pipeline to capture bounding box coordinates changes how you present data to end users. Storing these coordinates alongside your text chunks allows your front-end to render a direct visual overlay on the source document instead of just quoting the extracted string.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Build a Fast Multilingual OCR with Nemotron-OCR-v2
Learn how to deploy NVIDIA Nemotron-OCR-v2 for high-speed document extraction across six languages using synthetic data and GPU acceleration.
Volvo EX60 Routes External Camera Feeds to Gemini AI
Google and Volvo are integrating a specialized automotive version of Gemini into the EX60 SUV to process real-time external camera feeds for parking compliance.
Ai2's 4B MolmoMotion Maps Text Instructions to 3D Trajectories
Ai2 released MolmoMotion, an open-source 4B parameter model that predicts precise 3D physical trajectories from RGB video and natural language.
IBM Releases Granite 4.0 3B Vision for Document Parsing and Chart Extraction
IBM's Granite 4.0 3B Vision is a compact multimodal model optimized for document parsing, chart-to-code extraction, and high-accuracy data retrieval.
450ms Latency Desktop Automation Hits Gemini 3.5 Flash
Google DeepMind released Gemini 3.5 Flash with a new ComputerAction API, enabling the model to navigate digital interfaces with under 450ms of latency.