GLM-5.1 MoE Beats GPT-5.4 in Open-Source Engineering Milestone
Zhipu AI releases GLM-5.1 under MIT license, a 744B parameter MoE model that outperforms GPT-5.4 on the SWE-Bench Pro software engineering benchmark.
Zhipu AI, now operating as Z.ai following a January 2026 IPO, released GLM-5.1 under the permissive MIT License. The 744-billion parameter Mixture-of-Experts (MoE) model is optimized specifically for long-horizon autonomous engineering tasks. This release changes the established timeline for open-weight models, matching and exceeding proprietary frontier models on key software engineering metrics.
Architecture and Infrastructure
GLM-5.1 builds upon the GLM-5 base model with a specialized MoE architecture. The network utilizes Multi-head Latent Attention (MLA) and Dynamic Sparse Attention (DSA) to manage context retrieval over long inference sessions. The model supports a 200,000-token context window, paired with a strict 128,000-token output limit.
| Specification | Value |
|---|---|
| Total Parameters | 744 billion |
| Active Parameters (Per Forward Pass) | 40 billion |
| Total Experts | 256 |
| Active Experts (Per Token) | 8 |
The pretraining infrastructure marks a complete departure from Nvidia hardware. Z.ai trained the model entirely on 100,000 Huawei Ascend 910B chips, proving the viability of large-scale domestic hardware clusters for frontier model training.
Engineering Benchmark Results
The model’s primary optimization target is “productive horizons,” referring to the sustained time-on-task capabilities required for autonomous software development. Z.ai tuned GLM-5.1 to maintain a continuous “plan-execute-test-fix” loop. The model can operate autonomously for up to 8 hours and execute approximately 1,700 steps without human intervention.
If you evaluate and test AI agents for production workflows, the performance data establishes a new baseline for open weights. GLM-5.1 currently leads the SWE-Bench Pro leaderboard for resolving real-world GitHub issues.
| Model | SWE-Bench Pro Score |
|---|---|
| GLM-5.1 | 58.4 |
| GPT-5.4 | 57.7 |
| Claude Opus 4.6 | 57.3 |
Performance on Terminal-Bench 2.0 showed significant improvement over the previous generation. GLM-5.1 scored 69.0, a marked jump from GLM-5’s 56.2, though it remains behind GPT-5.4’s score of 75.1. The model also achieved a 68.7 on the CyberGym benchmark, tested across 1,507 real-world security tasks.
These high scores are specialized. GLM-5.1 still trails models from Google and OpenAI in general-purpose reasoning and standard knowledge benchmarks like GPQA Diamond.
Deployment and Pricing
The unrestrictive MIT License allows teams to run the LLM locally with the open weights hosted on Hugging Face. Self-hosting eliminates recurring inference costs for continuous agent workflows.
Cloud-based API usage reflects a new pricing strategy. Z.ai increased its API pricing by 8 to 17 percent to align with Western competitors. Premium tier token pricing now approaches Claude Sonnet 4.6 levels at $25 per million input tokens. Because agentic loops consume high token volumes during eight-hour planning and testing phases, utilizing the managed API requires strict controls to reduce API costs.
If you deploy autonomous coding agents, you now have a localized alternative to GPT-5.4. Calculate the total token consumption of your iterative workflows to determine if the hardware investment required to host GLM-5.1 yields a better return than standard API usage.
Get Insanely Good at AI
The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.
Keep Reading
How Cursor Built Composer 2 on Top of Kimi K2.5
Cursor's Composer 2 is built on Kimi K2.5. What continued pretraining, reinforcement learning, and self-summarization mean, and how they work.
Arcee's Trinity-Large-Thinking model defies big tech in US
The Trinity-Large-Thinking model offers a low-cost, open-source alternative for OpenClaw users following Anthropic's recent subscription policy changes.
NVIDIA Nemotron 3 Super Redefines Agentic AI with Hybrid MoE
NVIDIA's new Nemotron 3 Super combines Mamba and Transformer architectures with a 1-million token context window to power high-speed autonomous reasoning.
Gemma 4 Arrives With Full Apache 2.0 License
Google releases Gemma 4, a new generation of open models optimized for advanced reasoning, agentic workflows, and high-performance edge deployment.
Falcon Perception: TII's Open-Source Model for Dense Segmentation and OCR
Falcon Perception introduces an early-fusion Transformer architecture that outperforms Meta's SAM 3 in dense image segmentation and OCR-guided grounding.