New QIMMA Leaderboard Ranks Top Arabic AI Models
TII's QIMMA leaderboard introduces a quality-first validation pipeline and native Arabic code evaluation to redefine Large Language Model benchmarking.
The Technology Innovation Institute (TII) has launched QIMMA, an Arabic LLM leaderboard that filters out translation artifacts and encoding errors before ranking models. If you build Arabic language applications, this shifts how you select foundation models. Previous benchmarks often relied on direct English translations, creating cultural misalignments and flawed gold-standard answers. QIMMA addresses this by evaluating models purely on validated, native Arabic datasets.
Validation Before Evaluation
The leaderboard relies on a strict filtering mechanism. Every sample in the dataset passes through two different LLMs and human experts. This validation pipeline identifies incorrect gold answers and cultural biases. The process resulted in a high discard rate for samples in existing, widely used benchmarks.
Relying on unfiltered evaluation sets leads to inaccurate model rankings. By removing flawed translations, QIMMA provides a cleaner signal of true linguistic comprehension. This methodology changes how you evaluate and test AI agents operating in Arabic, moving the standard away from simply aggregating legacy benchmarks.
Benchmark Scope and Domains
The evaluation suite contains over 52,000 samples across 109 subsets. More than 99% of the data originates in native Arabic. The domains span STEM, Legal, Medical, and Trust & Safety. Specialized medical evaluation utilizes benchmarks like MedArabiQ and MedAraBench, while trust metrics rely on AraTrust.
QIMMA introduces the first Arabic code evaluation via 3LM, which adapts HumanEval+ and MBPP+. The system runs on transparent infrastructure, utilizing LightEval and EvalPlus.
Model Performance and Architecture
Initial rankings demonstrate the advantage of native training over scale. TII’s Falcon-H1-34B model, built with a Hybrid-Head architecture, matches or exceeds the performance of much larger generalist models on Arabic-specific tasks.
| Model | Parameters | Performance Context |
|---|---|---|
| Falcon-H1 | 34B | Top performance on native Arabic tasks |
| Qwen 2.5 | 72B | Outperformed or matched by Falcon-H1 |
| Llama 3.3 | 70B | Outperformed or matched by Falcon-H1 |
To support reproducibility, TII provides per-sample inference outputs for all evaluated models. You can audit exact success and failure modes on individual questions. This level of transparency is critical when you monitor AI applications for production readiness across diverse linguistic subsets.
When selecting an Arabic LLM for production, test your candidate models against the per-sample outputs provided in the QIMMA dataset. Examining exactly where larger models fail on native Arabic nuances will dictate whether you deploy a localized 34B model or absorb the inference costs of a 70B generalist.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
Build Real-Time Voice Agents with Cloudflare Agents SDK
Learn how to integrate low-latency voice interactions into your AI agents using Cloudflare's new @cloudflare/voice package and Durable Objects.
UEFN Conversations Tool Bridges Fortnite with Gemini AI
Epic Games brings unscripted AI NPCs to Unreal Editor for Fortnite using Google Gemini and ElevenLabs, strictly banning romantic role-play in new creator rules.
Boosting Drug Discovery via Paired Protein Language Model
Researchers at NUS unveil PPLM, a novel AI architecture that models protein-protein interactions with 17% higher accuracy than previous methods.
Outpacing Whisper: Cohere Transcribe Hits Top ASR Speed
Experience enterprise-grade audio intelligence with Cohere Transcribe, a new open-weights model topping the ASR leaderboard with 3x faster speeds than Whisper.
Scaling AI Gateway to Power Cloudflare's New Agentic Web
Cloudflare transforms its AI Gateway into a unified inference layer, offering persistent memory and dynamic runtimes to optimize multi-model agent workflows.