Ai Engineering 3 min read

New QIMMA Leaderboard Ranks Top Arabic AI Models

TII's QIMMA leaderboard introduces a quality-first validation pipeline and native Arabic code evaluation to redefine Large Language Model benchmarking.

The Technology Innovation Institute (TII) has launched QIMMA, an Arabic LLM leaderboard that filters out translation artifacts and encoding errors before ranking models. If you build Arabic language applications, this shifts how you select foundation models. Previous benchmarks often relied on direct English translations, creating cultural misalignments and flawed gold-standard answers. QIMMA addresses this by evaluating models purely on validated, native Arabic datasets.

Validation Before Evaluation

The leaderboard relies on a strict filtering mechanism. Every sample in the dataset passes through two different LLMs and human experts. This validation pipeline identifies incorrect gold answers and cultural biases. The process resulted in a high discard rate for samples in existing, widely used benchmarks.

Relying on unfiltered evaluation sets leads to inaccurate model rankings. By removing flawed translations, QIMMA provides a cleaner signal of true linguistic comprehension. This methodology changes how you evaluate and test AI agents operating in Arabic, moving the standard away from simply aggregating legacy benchmarks.

Benchmark Scope and Domains

The evaluation suite contains over 52,000 samples across 109 subsets. More than 99% of the data originates in native Arabic. The domains span STEM, Legal, Medical, and Trust & Safety. Specialized medical evaluation utilizes benchmarks like MedArabiQ and MedAraBench, while trust metrics rely on AraTrust.

QIMMA introduces the first Arabic code evaluation via 3LM, which adapts HumanEval+ and MBPP+. The system runs on transparent infrastructure, utilizing LightEval and EvalPlus.

Model Performance and Architecture

Initial rankings demonstrate the advantage of native training over scale. TII’s Falcon-H1-34B model, built with a Hybrid-Head architecture, matches or exceeds the performance of much larger generalist models on Arabic-specific tasks.

ModelParametersPerformance Context
Falcon-H134BTop performance on native Arabic tasks
Qwen 2.572BOutperformed or matched by Falcon-H1
Llama 3.370BOutperformed or matched by Falcon-H1

To support reproducibility, TII provides per-sample inference outputs for all evaluated models. You can audit exact success and failure modes on individual questions. This level of transparency is critical when you monitor AI applications for production readiness across diverse linguistic subsets.

When selecting an Arabic LLM for production, test your candidate models against the per-sample outputs provided in the QIMMA dataset. Examining exactly where larger models fail on native Arabic nuances will dictate whether you deploy a localized 34B model or absorb the inference costs of a 70B generalist.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading