Ai Engineering 3 min read

DharmaOCR 7B Proves Domain Alignment Beats Parameter Scaling

Dharma-AI has released two specialized OCR models, demonstrating that targeted training history outpaces general-purpose frontier models on structured tasks.

Dharma-AI’s latest technical research argues that domain alignment is a stronger determinant of task performance than raw parameter count. In a publication on the Hugging Face blog, the company details how aligning a model’s training history closely with its deployment task allows smaller architectures to outperform general-purpose frontier models.

To prove this thesis, the company released two specialized small language models designed specifically for structured document extraction: DharmaOCR Full (7 billion parameters) and DharmaOCR Lite (3 billion parameters).

Extraction Benchmarks

The models were evaluated against the DharmaOCR-Benchmark, a demanding dataset focused on document extraction in Brazilian Portuguese. The dataset includes public records like ESTER-Pt alongside dense internal legal and administrative documents.

When developers evaluate and test AI agents on document parsing, standard metrics often mask structural failures. Dharma-AI recorded both overall extraction accuracy and explicit degeneration rates.

ModelParametersBenchmark ScoreDegeneration Rate
DharmaOCR Full7B0.9250.40%
DharmaOCR Lite3B0.9110.20%

Direct Preference Optimization for Structural Stability

Standard decoders frequently suffer from structural text degeneration during dense extraction tasks. To fix this, Dharma-AI implemented a two-stage training pipeline. The models underwent initial Supervised Fine-Tuning followed by Direct Preference Optimization (DPO).

The application of DPO directly targeted the structural failures that standard benchmarks overlook. This secondary optimization phase reduced the overall degeneration rate by 87.6% relative to similar model families lacking this specialized alignment.

AWS Infrastructure and Context Limits

Dharma-AI standardized its research and deployment infrastructure on AWS g6e.2xlarge instances. This hardware tier utilizes NVIDIA L40S GPUs with 48GB of GDDR6 memory, paired with 8 AMD EPYC vCPUs and 64 GiB of RAM.

To manage AI inference reliably across dense document pages without truncation, orchestration is handled via vLLM with a strict context limit of 8,192 tokens. For enterprise environments requiring strict data residency, including GDPR and LGPD compliance, the models are available through the AWS Marketplace as “Sovereign & Confidential” deployments.

The Shift to Portfolio Procurement

Dharma-AI positions these results within a broader industry movement toward “Portfolio-based AI” procurement. Rather than signing a single comprehensive contract with providers of frontier models like GPT-5.5 or Claude 4.7, organizations are increasingly routing specific workloads to the most efficient specialized tool.

This pattern mirrors other recent industry shifts where specialized routing helps reduce LLM API costs in production. The report cites Cursor Composer 2.5 as a parallel example, noting that the specialized coding model matched GPT-5.5’s performance on standard agent benchmarks at approximately one-tenth of the operating cost.

If you build document extraction pipelines, replacing large generalized models with tightly aligned, domain-specific small language models will lower inference latency while stabilizing structured output formats.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading