Ai Coding 3 min read

12B JetBrains Mellum2 MoE Fits Agent Routines on One H100

JetBrains released Mellum2, an open-source 12-billion parameter MoE coding model optimized for sub-agent routing and low-latency inference.

On June 1, 2026, JetBrains released Mellum2, a 12-billion parameter Mixture-of-Experts coding assistant licensed under Apache 2.0. Operating with 2.5 billion active parameters per token, the model targets low-latency software engineering tasks like workload routing and rapid context summarization. By maintaining a small active parameter footprint, Mellum2 allows developers to run capable agent sub-routines entirely on local hardware.

Architecture and Training Data

Mellum2 relies on a 64-expert routing configuration. To manage memory and compute during AI inference, the model combines Grouped-Query Attention with four KV heads and a 1,024-token Sliding Window Attention applied to three out of every four layers.

The model supports a 131,072-token context window extended via layer-selective YaRN. JetBrains trained the foundation on 10.6 trillion tokens of natural language and code. During the three-phase training curriculum, the ratio of code in the dataset scaled from 23% to 42% and finally to 59%, forcing the model to specialize in programmatic structure during late-stage pre-training.

Released Variants and Benchmarks

JetBrains released three primary weights: Base, Instruct, and a reasoning-augmented Thinking variant. The Thinking model utilizes reinforcement learning with verifiable rewards to emit reasoning traces inside explicit XML blocks before answering.

The architecture achieves throughput that matches or exceeds Qwen2.5-7B on a single NVIDIA H100 GPU. The Thinking variant scored 69.9% on LiveCodeBench v6 and 58.4% on the AIME 2025/2026 mean. For general capabilities, JetBrains provided direct comparisons between the instructed and reasoning models:

BenchmarkInstruct VariantThinking Variant
MMLU-Redux86.2%88.3%
GPQA Diamond57.6%76.8%

Workload Routing and Sub-Agents

Mellum2 functions as a focal routing layer rather than a frontier model. If you build multi-agent systems, you can deploy Mellum2 as an intent analyzer to direct complex prompts to larger models or specialized tools. Its 2.5B active parameter size enables fast execution for sub-agent tasks like validation and planning without incurring the latency of API calls to cloud providers.

The compact footprint supports air-gapped environments and localized data handling. Developers can run the model on individual machines or on-premise infrastructure for secure code retrieval and local summarization.

If your current architecture relies on large dense models for basic routing and formatting, replacing those specific steps with Mellum2-12B-A2.5B-Instruct will directly lower your task latency. Evaluate the Thinking variant for local validation loops where explicit reasoning traces can improve the reliability of automated code edits.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading