New QIMMA Leaderboard Ranks Top Arabic AI Models

The Technology Innovation Institute (TII) has launched QIMMA, an Arabic LLM leaderboard that filters out translation artifacts and encoding errors before ranking models. If you build Arabic language applications, this shifts how you select foundation models. Previous benchmarks often relied on direct English translations, creating cultural misalignments and flawed gold-standard answers. QIMMA addresses this by evaluating models purely on validated, native Arabic datasets.

Validation Before Evaluation

The leaderboard relies on a strict filtering mechanism. Every sample in the dataset passes through two different LLMs and human experts. This validation pipeline identifies incorrect gold answers and cultural biases. The process resulted in a high discard rate for samples in existing, widely used benchmarks.

Relying on unfiltered evaluation sets leads to inaccurate model rankings. By removing flawed translations, QIMMA provides a cleaner signal of true linguistic comprehension. This methodology changes how you evaluate and test AI agents operating in Arabic, moving the standard away from simply aggregating legacy benchmarks.

Benchmark Scope and Domains

The evaluation suite contains over 52,000 samples across 109 subsets. More than 99% of the data originates in native Arabic. The domains span STEM, Legal, Medical, and Trust & Safety. Specialized medical evaluation utilizes benchmarks like MedArabiQ and MedAraBench, while trust metrics rely on AraTrust.

QIMMA introduces the first Arabic code evaluation via 3LM, which adapts HumanEval+ and MBPP+. The system runs on transparent infrastructure, utilizing LightEval and EvalPlus.

Model Performance and Architecture

Initial rankings demonstrate the advantage of native training over scale. TII’s Falcon-H1-34B model, built with a Hybrid-Head architecture, matches or exceeds the performance of much larger generalist models on Arabic-specific tasks.

Model	Parameters	Performance Context
Falcon-H1	34B	Top performance on native Arabic tasks
Qwen 2.5	72B	Outperformed or matched by Falcon-H1
Llama 3.3	70B	Outperformed or matched by Falcon-H1

To support reproducibility, TII provides per-sample inference outputs for all evaluated models. You can audit exact success and failure modes on individual questions. This level of transparency is critical when you monitor AI applications for production readiness across diverse linguistic subsets.

When selecting an Arabic LLM for production, test your candidate models against the per-sample outputs provided in the QIMMA dataset. Examining exactly where larger models fail on native Arabic nuances will dictate whether you deploy a localized 34B model or absorb the inference costs of a 70B generalist.

New QIMMA Leaderboard Ranks Top Arabic AI Models

Validation Before Evaluation

Benchmark Scope and Domains

Model Performance and Architecture

Keep Reading

Build Real-Time Voice Agents with Cloudflare Agents SDK

UEFN Conversations Tool Bridges Fortnite with Gemini AI

Boosting Drug Discovery via Paired Protein Language Model

Outpacing Whisper: Cohere Transcribe Hits Top ASR Speed

Scaling AI Gateway to Power Cloudflare's New Agentic Web