12B JetBrains Mellum2 MoE Fits Agent Routines on One H100

On June 1, 2026, JetBrains released Mellum2, a 12-billion parameter Mixture-of-Experts coding assistant licensed under Apache 2.0. Operating with 2.5 billion active parameters per token, the model targets low-latency software engineering tasks like workload routing and rapid context summarization. By maintaining a small active parameter footprint, Mellum2 allows developers to run capable agent sub-routines entirely on local hardware.

Architecture and Training Data

Mellum2 relies on a 64-expert routing configuration. To manage memory and compute during AI inference, the model combines Grouped-Query Attention with four KV heads and a 1,024-token Sliding Window Attention applied to three out of every four layers.

The model supports a 131,072-token context window extended via layer-selective YaRN. JetBrains trained the foundation on 10.6 trillion tokens of natural language and code. During the three-phase training curriculum, the ratio of code in the dataset scaled from 23% to 42% and finally to 59%, forcing the model to specialize in programmatic structure during late-stage pre-training.

Released Variants and Benchmarks

JetBrains released three primary weights: Base, Instruct, and a reasoning-augmented Thinking variant. The Thinking model utilizes reinforcement learning with verifiable rewards to emit reasoning traces inside explicit XML blocks before answering.

The architecture achieves throughput that matches or exceeds Qwen2.5-7B on a single NVIDIA H100 GPU. The Thinking variant scored 69.9% on LiveCodeBench v6 and 58.4% on the AIME 2025/2026 mean. For general capabilities, JetBrains provided direct comparisons between the instructed and reasoning models:

Benchmark	Instruct Variant	Thinking Variant
MMLU-Redux	86.2%	88.3%
GPQA Diamond	57.6%	76.8%

Workload Routing and Sub-Agents

Mellum2 functions as a focal routing layer rather than a frontier model. If you build multi-agent systems, you can deploy Mellum2 as an intent analyzer to direct complex prompts to larger models or specialized tools. Its 2.5B active parameter size enables fast execution for sub-agent tasks like validation and planning without incurring the latency of API calls to cloud providers.

The compact footprint supports air-gapped environments and localized data handling. Developers can run the model on individual machines or on-premise infrastructure for secure code retrieval and local summarization.

If your current architecture relies on large dense models for basic routing and formatting, replacing those specific steps with Mellum2-12B-A2.5B-Instruct will directly lower your task latency. Evaluate the Thinking variant for local validation loops where explicit reasoning traces can improve the reliability of automated code edits.

12B JetBrains Mellum2 MoE Fits Agent Routines on One H100

Architecture and Training Data

Released Variants and Benchmarks

Workload Routing and Sub-Agents

Keep Reading

How to Speed Up MoE Fine-Tuning With NeMo AutoModel

Cohere Ships 30B MoE North-Mini-Code for Local Coding Agents

Google's Frozen v2 Chip Hardwires Gemini for 10x Efficiency

Hidden Caching Costs Make Sonnet 4.6 Cheaper Than GPT-4.1

Native-Speed vLLM Backend Ships for 450+ Transformers Models