Ai Engineering 3 min read

GLM-5.1 MoE Beats GPT-5.4 in Open-Source Engineering Milestone

Zhipu AI releases GLM-5.1 under MIT license, a 744B parameter MoE model that outperforms GPT-5.4 on the SWE-Bench Pro software engineering benchmark.

Zhipu AI, now operating as Z.ai following a January 2026 IPO, released GLM-5.1 under the permissive MIT License. The 744-billion parameter Mixture-of-Experts (MoE) model is optimized specifically for long-horizon autonomous engineering tasks. This release changes the established timeline for open-weight models, matching and exceeding proprietary frontier models on key software engineering metrics.

Architecture and Infrastructure

GLM-5.1 builds upon the GLM-5 base model with a specialized MoE architecture. The network utilizes Multi-head Latent Attention (MLA) and Dynamic Sparse Attention (DSA) to manage context retrieval over long inference sessions. The model supports a 200,000-token context window, paired with a strict 128,000-token output limit.

SpecificationValue
Total Parameters744 billion
Active Parameters (Per Forward Pass)40 billion
Total Experts256
Active Experts (Per Token)8

The pretraining infrastructure marks a complete departure from Nvidia hardware. Z.ai trained the model entirely on 100,000 Huawei Ascend 910B chips, proving the viability of large-scale domestic hardware clusters for frontier model training.

Engineering Benchmark Results

The model’s primary optimization target is “productive horizons,” referring to the sustained time-on-task capabilities required for autonomous software development. Z.ai tuned GLM-5.1 to maintain a continuous “plan-execute-test-fix” loop. The model can operate autonomously for up to 8 hours and execute approximately 1,700 steps without human intervention.

If you evaluate and test AI agents for production workflows, the performance data establishes a new baseline for open weights. GLM-5.1 currently leads the SWE-Bench Pro leaderboard for resolving real-world GitHub issues.

ModelSWE-Bench Pro Score
GLM-5.158.4
GPT-5.457.7
Claude Opus 4.657.3

Performance on Terminal-Bench 2.0 showed significant improvement over the previous generation. GLM-5.1 scored 69.0, a marked jump from GLM-5’s 56.2, though it remains behind GPT-5.4’s score of 75.1. The model also achieved a 68.7 on the CyberGym benchmark, tested across 1,507 real-world security tasks.

These high scores are specialized. GLM-5.1 still trails models from Google and OpenAI in general-purpose reasoning and standard knowledge benchmarks like GPQA Diamond.

Deployment and Pricing

The unrestrictive MIT License allows teams to run the LLM locally with the open weights hosted on Hugging Face. Self-hosting eliminates recurring inference costs for continuous agent workflows.

Cloud-based API usage reflects a new pricing strategy. Z.ai increased its API pricing by 8 to 17 percent to align with Western competitors. Premium tier token pricing now approaches Claude Sonnet 4.6 levels at $25 per million input tokens. Because agentic loops consume high token volumes during eight-hour planning and testing phases, utilizing the managed API requires strict controls to reduce API costs.

If you deploy autonomous coding agents, you now have a localized alternative to GPT-5.4. Calculate the total token consumption of your iterative workflows to determine if the hardware investment required to host GLM-5.1 yields a better return than standard API usage.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading