12B JetBrains Mellum2 MoE Fits Agent Routines on One H100
JetBrains released Mellum2, an open-source 12-billion parameter MoE coding model optimized for sub-agent routing and low-latency inference.
On June 1, 2026, JetBrains released Mellum2, a 12-billion parameter Mixture-of-Experts coding assistant licensed under Apache 2.0. Operating with 2.5 billion active parameters per token, the model targets low-latency software engineering tasks like workload routing and rapid context summarization. By maintaining a small active parameter footprint, Mellum2 allows developers to run capable agent sub-routines entirely on local hardware.
Architecture and Training Data
Mellum2 relies on a 64-expert routing configuration. To manage memory and compute during AI inference, the model combines Grouped-Query Attention with four KV heads and a 1,024-token Sliding Window Attention applied to three out of every four layers.
The model supports a 131,072-token context window extended via layer-selective YaRN. JetBrains trained the foundation on 10.6 trillion tokens of natural language and code. During the three-phase training curriculum, the ratio of code in the dataset scaled from 23% to 42% and finally to 59%, forcing the model to specialize in programmatic structure during late-stage pre-training.
Released Variants and Benchmarks
JetBrains released three primary weights: Base, Instruct, and a reasoning-augmented Thinking variant. The Thinking model utilizes reinforcement learning with verifiable rewards to emit reasoning traces inside explicit XML blocks before answering.
The architecture achieves throughput that matches or exceeds Qwen2.5-7B on a single NVIDIA H100 GPU. The Thinking variant scored 69.9% on LiveCodeBench v6 and 58.4% on the AIME 2025/2026 mean. For general capabilities, JetBrains provided direct comparisons between the instructed and reasoning models:
| Benchmark | Instruct Variant | Thinking Variant |
|---|---|---|
| MMLU-Redux | 86.2% | 88.3% |
| GPQA Diamond | 57.6% | 76.8% |
Workload Routing and Sub-Agents
Mellum2 functions as a focal routing layer rather than a frontier model. If you build multi-agent systems, you can deploy Mellum2 as an intent analyzer to direct complex prompts to larger models or specialized tools. Its 2.5B active parameter size enables fast execution for sub-agent tasks like validation and planning without incurring the latency of API calls to cloud providers.
The compact footprint supports air-gapped environments and localized data handling. Developers can run the model on individual machines or on-premise infrastructure for secure code retrieval and local summarization.
If your current architecture relies on large dense models for basic routing and formatting, replacing those specific steps with Mellum2-12B-A2.5B-Instruct will directly lower your task latency. Evaluate the Thinking variant for local validation loops where explicit reasoning traces can improve the reliability of automated code edits.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How Cursor Built Composer 2 on Top of Kimi K2.5
Cursor's Composer 2 is built on Kimi K2.5. What continued pretraining, reinforcement learning, and self-summarization mean, and how they work.
TML-Interaction-Small Achieves 0.40s Full-Duplex Latency
Thinking Machines Lab has released a research preview of TML-Interaction-Small, a 276-billion-parameter Mixture-of-Experts model for full-duplex conversation.
EMO Pretraining Decouples Mixture-of-Experts Subsets
AI2 and UC Berkeley researchers introduced EMO, a pretraining constraint that groups MoE experts by semantic domain to allow independent subnet deployment.
DeepInfra Brings $0.08/1M Inference to Hugging Face Hub
Developers can now route Hugging Face API requests directly to DeepInfra's serverless GPU infrastructure for high-performance model inference.
IBM Granite 4.1 Pushes Dense 8B Model Past Previous 32B MoE
IBM released the Granite 4.1 open-source model family featuring dense text architectures, a 512K context window, and specialized vision and speech variants.