Ai Engineering 9 min read

What Is Mixture-of-Experts (MoE) in AI?

MoE models have a trillion parameters but only activate a fraction per token. How expert routing works, why it matters for cost, and which major models use it.

The models behind DeepSeek V3, Mixtral, Kimi K2.5, and Google’s Gemini series (including Gemini 3.1 Pro) all share one architectural decision: Mixture-of-Experts (MoE). It is the reason these models can have hundreds of billions (or even a trillion) parameters while costing less to run than much smaller dense models.

If you work with LLMs and care about performance, cost, or model selection, MoE is worth understanding. It changes what “model size” even means.

Dense Models vs. Sparse Models

In a dense model (like the original GPT-3, Llama 2, or Claude), every parameter participates in processing every token. A 70-billion-parameter dense model runs all 70 billion parameters for every word in your prompt and every word in the response. More parameters means more knowledge and capability, but also more compute per token.

This creates a tradeoff: bigger models are smarter but slower and more expensive. Smaller models are cheaper but less capable.

Sparse models break this tradeoff. A sparse model has a large total number of parameters, but only activates a subset of them for each token. The rest sit idle. The model has the knowledge capacity of a very large model, but the inference cost of a much smaller one.

MoE is the dominant way to build sparse models today.

How MoE Works

A standard transformer block has two main components: an attention layer and a feed-forward network (FFN). In an MoE model, the attention layer stays the same. The change is in the FFN.

Instead of one FFN, an MoE layer has many smaller FFNs called experts. Each expert is a complete feed-forward network with its own weights. The model also has a small neural network called a router (sometimes called a gating network) that decides which experts handle each token.

The process for every token:

  1. The token arrives at the MoE layer after passing through attention.
  2. The router looks at the token’s representation and produces a score for every expert.
  3. The top-k experts (usually 1 or 2) are selected based on the highest scores.
  4. Only those selected experts process the token. The others do nothing.
  5. The outputs from the selected experts are combined using the router’s scores as weights.

The result: a layer with, say, 64 experts where only 2 activate per token. The layer stores 64 experts worth of learned knowledge, but each token only costs 2 experts worth of compute.

Real Numbers

This is easier to understand with concrete models:

ModelTotal ParamsActive ParamsExpertsActive per Token
Mixtral 8x7B46.7B12.9B82
DeepSeek V3671B37B257 (256 + 1 shared)9 (8 routed + 1 shared)
Kimi K2.51T32B385 (384 + 1 shared)9 (8 routed + 1 shared)
Qwen 2.5 MaxNot disclosedNot disclosedMoEMoE

Mixtral 8x7B has 46.7 billion total parameters but runs like a 13B model. DeepSeek V3 has 671 billion parameters but activates 37 billion per token. Kimi K2.5 has a full trillion parameters but only uses 32 billion at a time.

The “size” that determines inference cost and speed is the active parameter count, not the total. This is why MoE models can be enormous without being proportionally expensive to run.

The Router

The router is what makes MoE work, and also what makes it tricky to train. It is a learned component: a small linear layer that takes a token’s hidden state as input and outputs a score for each expert. During training, the router learns which types of tokens should go to which experts.

Different models use different routing strategies:

Top-2 routing selects the two highest-scoring experts for each token. Mixtral uses this approach. Each selected expert produces an output, and the outputs are combined using the router’s softmax-normalized scores as weights. This provides a blend of two expert perspectives for every token.

Top-1 routing sends each token to exactly one expert. The Switch Transformer (Fedus, Zoph, and Shazeer, 2022) showed that this counterintuitive simplification actually works. Routing to one expert instead of two cuts communication costs and simplifies gradient computation, while still improving model quality. The tradeoff is less blending between expert specializations, but the efficiency gains often outweigh this.

Shared experts are a more recent innovation, used in DeepSeek V3 and Kimi K2.5. One expert is always active regardless of routing. This shared expert learns patterns that are useful across all inputs (common language structures, general reasoning patterns), while the routed experts specialize in narrower areas. The shared expert prevents the model from having to “waste” a routing slot on general-purpose knowledge.

The Load Balancing Problem

If the router were left entirely to its own devices, it would often converge on sending most tokens to just a few experts. This is called expert collapse: a handful of experts get very good because they see most of the data, while the rest are undertrained and useless. The model effectively becomes a smaller dense model, wasting all the capacity in the unused experts.

The original MoE paper (Shazeer et al., 2017) addressed this with an auxiliary loss: a penalty term added to the training objective that encourages the router to distribute tokens roughly evenly across experts. If the distribution becomes too unbalanced, the loss increases, pushing the router toward more even allocation.

This works, but the auxiliary loss is a blunt instrument. Setting it too high forces artificial balance at the expense of quality (tokens get sent to suboptimal experts just to meet the quota). Setting it too low does not prevent collapse.

DeepSeek V3 introduced an auxiliary-loss-free approach: instead of penalizing imbalance through the loss function, it adds learned bias terms to the router’s affinity scores during top-k selection. These biases gently steer the distribution without interfering with the actual training signal. DeepSeek reports better performance with this method than with auxiliary losses.

What Experts Actually Learn

Experts do not neatly map to human-understandable categories like “the code expert” or “the math expert.” The specialization is more granular and statistical. Research on expert activation patterns shows that experts tend to specialize around syntactic and semantic clusters: certain experts activate more for punctuation, others for domain-specific vocabulary, others for particular syntactic structures.

This means the routing is not a high-level “topic detector.” It operates at the token level, making fine-grained decisions about which expert’s weights are most useful for the specific context around each token.

Why MoE Matters for Costs

If you are building applications on LLM APIs, MoE architecture directly affects your costs. Two models with similar benchmark scores might have very different pricing because one is dense and the other is MoE.

DeepSeek V3 (671B total, 37B active) and Llama 3.1 405B (dense, all 405B active) are in a similar performance tier. But DeepSeek V3 activates roughly 10x fewer parameters per token. This translates to lower inference costs and faster generation, which is why DeepSeek’s API pricing is significantly lower than comparable dense models.

The same logic applies to Cursor’s Composer 2, which is built on Kimi K2.5 (1T total, 32B active). Despite being built on a trillion-parameter model, Cursor prices it at $0.50 per million input tokens, competitive with much smaller models, because only 32 billion parameters run per token.

When evaluating models, the active parameter count is a better proxy for cost and speed than total parameter count.

Why MoE Matters for Self-Hosting

If you run models locally, MoE introduces a specific constraint: memory. Even though only 32B parameters activate per token, all 1T parameters need to be loaded into memory (or at least accessible from disk), because the router decides at runtime which experts to use. You cannot predict in advance which experts a given input will need.

For Kimi K2.5 at FP16 precision, that means roughly 2TB of VRAM just for the weights. This is why very large MoE models are primarily served from cloud infrastructure with many GPUs, not from local machines.

Smaller MoE models like Mixtral 8x7B (46.7B total) are more practical for local deployment. At 4-bit quantization, Mixtral fits in roughly 26GB of VRAM, which is within reach of high-end consumer GPUs.

The History

The MoE concept predates modern LLMs by decades. Jacobs et al. proposed the original Mixture of Experts framework in 1991. But the modern application to transformers and LLMs started with the 2017 paper “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” by Noam Shazeer and colleagues at Google, which demonstrated MoE layers scaling to 137 billion parameters with 1000x improvements in model capacity for minimal compute overhead.

The Switch Transformer (2022) simplified routing to top-1 and showed MoE could scale to over a trillion parameters. Google’s Gemini series adopted MoE starting with Gemini 1.5 Pro (2024), and the architecture continues through Gemini 2.5 Pro and the latest Gemini 3.1 Pro (February 2026). Mixtral 8x7B (December 2023) proved that open-source MoE models could match much larger dense models. And DeepSeek V3 and Kimi K2.5 (both released in the past year) pushed the scale to 671B and 1T total parameters respectively, with innovations in load balancing and training efficiency.

MoE is now the default architecture for cost-efficient frontier models. Understanding it helps you make better decisions about which models to use, how to estimate costs, and what the tradeoffs are between model size, speed, and capability.

Key Takeaways

  • Total parameters are not active parameters. A 1T MoE model that activates 32B per token costs roughly the same as a 32B dense model to run.
  • The router is learned, not programmed. It discovers which experts to use during training, operating at the token level.
  • Load balancing is a core training challenge. Without it, experts collapse. Modern approaches (auxiliary loss, bias terms) address this differently.
  • Memory requirements are based on total parameters. You need to store all the experts even if most are idle per token.
  • For API users, MoE means better price-performance. Models like DeepSeek V3 and Kimi K2.5 deliver frontier capability at lower cost because they activate fewer parameters.
Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading