DharmaOCR 7B Proves Domain Alignment Beats Parameter Scaling

Dharma-AI’s latest technical research argues that domain alignment is a stronger determinant of task performance than raw parameter count. In a publication on the Hugging Face blog, the company details how aligning a model’s training history closely with its deployment task allows smaller architectures to outperform general-purpose frontier models.

To prove this thesis, the company released two specialized small language models designed specifically for structured document extraction: DharmaOCR Full (7 billion parameters) and DharmaOCR Lite (3 billion parameters).

Extraction Benchmarks

The models were evaluated against the DharmaOCR-Benchmark, a demanding dataset focused on document extraction in Brazilian Portuguese. The dataset includes public records like ESTER-Pt alongside dense internal legal and administrative documents.

When developers evaluate and test AI agents on document parsing, standard metrics often mask structural failures. Dharma-AI recorded both overall extraction accuracy and explicit degeneration rates.

Model	Parameters	Benchmark Score	Degeneration Rate
DharmaOCR Full	7B	0.925	0.40%
DharmaOCR Lite	3B	0.911	0.20%

Direct Preference Optimization for Structural Stability

Standard decoders frequently suffer from structural text degeneration during dense extraction tasks. To fix this, Dharma-AI implemented a two-stage training pipeline. The models underwent initial Supervised Fine-Tuning followed by Direct Preference Optimization (DPO).

The application of DPO directly targeted the structural failures that standard benchmarks overlook. This secondary optimization phase reduced the overall degeneration rate by 87.6% relative to similar model families lacking this specialized alignment.

AWS Infrastructure and Context Limits

Dharma-AI standardized its research and deployment infrastructure on AWS g6e.2xlarge instances. This hardware tier utilizes NVIDIA L40S GPUs with 48GB of GDDR6 memory, paired with 8 AMD EPYC vCPUs and 64 GiB of RAM.

To manage AI inference reliably across dense document pages without truncation, orchestration is handled via vLLM with a strict context limit of 8,192 tokens. For enterprise environments requiring strict data residency, including GDPR and LGPD compliance, the models are available through the AWS Marketplace as “Sovereign & Confidential” deployments.

The Shift to Portfolio Procurement

Dharma-AI positions these results within a broader industry movement toward “Portfolio-based AI” procurement. Rather than signing a single comprehensive contract with providers of frontier models like GPT-5.5 or Claude 4.7, organizations are increasingly routing specific workloads to the most efficient specialized tool.

This pattern mirrors other recent industry shifts where specialized routing helps reduce LLM API costs in production. The report cites Cursor Composer 2.5 as a parallel example, noting that the specialized coding model matched GPT-5.5’s performance on standard agent benchmarks at approximately one-tenth of the operating cost.

If you build document extraction pipelines, replacing large generalized models with tightly aligned, domain-specific small language models will lower inference latency while stabilizing structured output formats.

DharmaOCR 7B Proves Domain Alignment Beats Parameter Scaling

Extraction Benchmarks

Direct Preference Optimization for Structural Stability

AWS Infrastructure and Context Limits

The Shift to Portfolio Procurement

Keep Reading

How to Run Gemma 4 On-Device with LiteRT-LM

PaddleOCR 3.5 Adds Transformers Backend and Browser Inference

Apache 2.0 Gets 218B Command A+ as Cohere Acquires Reliant AI

8K Context Reranking Hits Hugging Face With Ettin Cross-Encoders

OlmoEarth v1.1 Tops DINOv3 in Remote Sensing Benchmarks