GEO

Mixture of Experts (MoE)

Mixture of Experts (MoE) is a neural network architecture in which an LLM contains many specialized "expert" sub-networks and, for each input token, a gating mechanism activates only a small subset — typically 2 of 8, or 8 of 256 — while leaving the rest idle. The model behaves like a huge parameter count (capacity) while paying the inference cost of a much smaller model.

Mixture of Experts (MoE) is a neural network architecture in which an LLM contains many specialized "expert" sub-networks and, for each input token, a gating mechanism activates only a small subset — typically 2 of 8, or 8 of 256 — while leaving the rest idle. The model behaves like a huge parameter count (capacity) while paying the inference cost of a much smaller model.

Why It Matters

MoE is how modern LLMs keep getting smarter without exploding inference cost. Mixtral 8×7B, DeepSeek-V3, Grok-2, and reportedly GPT-4 all use MoE to stack capacity cheaply. A dense 400B-parameter model has to run all 400B weights on every token. A MoE model with 400B total parameters but only 40B active at a time runs about 10× faster and cheaper, while retaining most of the knowledge the extra parameters encode. For builders, MoE changes which frontier models are actually affordable to serve and which open-source options are feasible at scale.

How It Works

Experts: Inside each transformer block, the feed-forward layer is replaced with N parallel feed-forward networks ("experts"). Typical: 8, 16, 64, or 256 experts per layer.

Gating network: A small learned router decides which top-k experts get to process each token. k is usually 1 or 2.

Sparse activation: Only the selected experts run their weights for that token. The rest don't contribute, so compute scales with k × expert size, not total expert count.

Load balancing: A training-time loss encourages the router to distribute tokens evenly so no expert starves. Otherwise the model collapses to using a few experts and wastes the rest.

Aggregation: The outputs of the selected experts are weighted by gate scores and summed.

Total vs Active Parameters

Every MoE spec has two numbers:

  • Total parameters: The full model weight count (determines memory).
  • Active parameters: Per-token compute (determines inference cost).

Example: Mixtral 8×7B has ~47B total parameters but only ~13B active per token. DeepSeek-V3 has 671B total and 37B active. The gap is where MoE's magic lives.

Why It Works

Different experts specialize implicitly during training. One expert might become the "code" expert, another the "math" expert, another the "European languages" expert. The router learns to send the right tokens to the right experts. This is similar in spirit to how humans use different brain regions for different tasks — efficient routing of signal, not a monolithic process.

Trade-offs

Memory: Even though only some experts run per token, all experts sit in VRAM. A 671B MoE still needs enough GPU memory for 671B parameters.

Serving complexity: Routing tokens to specific experts is harder to parallelize than dense inference. Specialized inference engines (vLLM, TensorRT, DeepSpeed) are usually required.

Training instability: Load balancing, expert collapse, and router noise make MoE training trickier than dense training.

Communication overhead: In distributed training, token-to-expert routing requires all-to-all GPU communication. Networking becomes a bottleneck.

Fine-tuning difficulty: MoE models are harder to fine-tune effectively — router dynamics drift with new data.

MoE vs Dense

AspectDenseMoE
Per-token computeAll parametersk of N experts
Memory footprintSmall for sizeLarge for size
Inference costProportional to total paramsProportional to active params
Training difficultyStandardHarder (balancing, routing)
SpecializationImplicit in layersExplicit in experts

Rule of thumb: MoE wins on cost per token and compute utilization; dense wins on memory efficiency and fine-tuning ergonomics.

Notable MoE Models

  • Mixtral 8×7B (Mistral, 2023): First widely-used open MoE. 47B total, 13B active.
  • Mixtral 8×22B (Mistral, 2024): Larger variant.
  • DeepSeek-V3 / V3.1 / R1 (DeepSeek, 2024–2025): 671B total, 37B active. Extreme MoE with 256 experts per layer, 8 active.
  • Grok-2 (xAI, 2024): MoE architecture.
  • GPT-4 and Claude Opus: Widely believed to use MoE internally (not officially confirmed for all).

Sources: