DharmaOCR 7B Proves Domain Alignment Beats Parameter Scaling
Dharma-AI has released two specialized OCR models, demonstrating that targeted training history outpaces general-purpose frontier models on structured tasks.
Dharma-AI’s latest technical research argues that domain alignment is a stronger determinant of task performance than raw parameter count. In a publication on the Hugging Face blog, the company details how aligning a model’s training history closely with its deployment task allows smaller architectures to outperform general-purpose frontier models.
To prove this thesis, the company released two specialized small language models designed specifically for structured document extraction: DharmaOCR Full (7 billion parameters) and DharmaOCR Lite (3 billion parameters).
Extraction Benchmarks
The models were evaluated against the DharmaOCR-Benchmark, a demanding dataset focused on document extraction in Brazilian Portuguese. The dataset includes public records like ESTER-Pt alongside dense internal legal and administrative documents.
When developers evaluate and test AI agents on document parsing, standard metrics often mask structural failures. Dharma-AI recorded both overall extraction accuracy and explicit degeneration rates.
| Model | Parameters | Benchmark Score | Degeneration Rate |
|---|---|---|---|
| DharmaOCR Full | 7B | 0.925 | 0.40% |
| DharmaOCR Lite | 3B | 0.911 | 0.20% |
Direct Preference Optimization for Structural Stability
Standard decoders frequently suffer from structural text degeneration during dense extraction tasks. To fix this, Dharma-AI implemented a two-stage training pipeline. The models underwent initial Supervised Fine-Tuning followed by Direct Preference Optimization (DPO).
The application of DPO directly targeted the structural failures that standard benchmarks overlook. This secondary optimization phase reduced the overall degeneration rate by 87.6% relative to similar model families lacking this specialized alignment.
AWS Infrastructure and Context Limits
Dharma-AI standardized its research and deployment infrastructure on AWS g6e.2xlarge instances. This hardware tier utilizes NVIDIA L40S GPUs with 48GB of GDDR6 memory, paired with 8 AMD EPYC vCPUs and 64 GiB of RAM.
To manage AI inference reliably across dense document pages without truncation, orchestration is handled via vLLM with a strict context limit of 8,192 tokens. For enterprise environments requiring strict data residency, including GDPR and LGPD compliance, the models are available through the AWS Marketplace as “Sovereign & Confidential” deployments.
The Shift to Portfolio Procurement
Dharma-AI positions these results within a broader industry movement toward “Portfolio-based AI” procurement. Rather than signing a single comprehensive contract with providers of frontier models like GPT-5.5 or Claude 4.7, organizations are increasingly routing specific workloads to the most efficient specialized tool.
This pattern mirrors other recent industry shifts where specialized routing helps reduce LLM API costs in production. The report cites Cursor Composer 2.5 as a parallel example, noting that the specialized coding model matched GPT-5.5’s performance on standard agent benchmarks at approximately one-tenth of the operating cost.
If you build document extraction pipelines, replacing large generalized models with tightly aligned, domain-specific small language models will lower inference latency while stabilizing structured output formats.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How to Run Gemma 4 On-Device with LiteRT-LM
Learn how to configure LiteRT-LM to deploy the Gemma 4 model family locally across mobile, desktop, and edge environments with constrained JSON decoding.
PaddleOCR 3.5 Adds Transformers Backend and Browser Inference
The PaddleOCR 3.5 update decouples the toolkit from the PaddlePaddle framework by adding a native Transformers backend and client-side browser execution.
Apache 2.0 Gets 218B Command A+ as Cohere Acquires Reliant AI
Cohere expanded its sovereign AI strategy by open-sourcing the 218-billion parameter Command A+ model and acquiring biopharma startup Reliant AI.
8K Context Reranking Hits Hugging Face With Ettin Cross-Encoders
Hugging Face released six open-source cross-encoders under the Ettin Reranker family with an 8,192-token context window for long-form document retrieval.
OlmoEarth v1.1 Tops DINOv3 in Remote Sensing Benchmarks
Ai2 updated its multimodal Earth observation models with OlmoEarth v1.1, bringing enhanced training efficiency and state-of-the-art benchmark performance.