What Is Mixture of Experts (MoE)? | GEO Glossary

Mixture of Experts (MoE) is a neural network architecture in which an LLM contains many specialized "expert" sub-networks and, for each input token, a gating mechanism activates only a small subset — typically 2 of 8, or 8 of 256 — while leaving the rest idle. The model behaves like a huge parameter count (capacity) while paying the inference cost of a much smaller model.

Why It Matters

MoE is how modern LLMs keep getting smarter without exploding inference cost. DeepSeek-V3 and V4, Llama 4 Scout and Maverick, and Grok 4 all use MoE to stack capacity cheaply. A dense 400B-parameter model has to run all 400B weights on every token. A MoE model with 400B total parameters but only 40B active at a time runs about 10× faster and cheaper, while retaining most of the knowledge the extra parameters encode. For builders, MoE changes which frontier models are actually affordable to serve and which open-source options are feasible at scale.

How It Works

Experts: Inside each transformer block, the feed-forward layer is replaced with N parallel feed-forward networks ("experts"). Typical: 8, 16, 64, or 256 experts per layer.

Gating network: A small learned router decides which top-k experts get to process each token. k is usually 1 or 2.

Sparse activation: Only the selected experts run their weights for that token. The rest don't contribute, so compute scales with k × expert size, not total expert count.

Load balancing: A training-time loss encourages the router to distribute tokens evenly so no expert starves. Otherwise the model collapses to using a few experts and wastes the rest.

Aggregation: The outputs of the selected experts are weighted by gate scores and summed.

Total vs Active Parameters

Every MoE spec has two numbers:

Total parameters: The full model weight count (determines memory).
Active parameters: Per-token compute (determines inference cost).

Example: Mixtral 8×7B has ~47B total parameters but only ~13B active per token. DeepSeek-V3 has 671B total and 37B active. The gap is where MoE's magic lives.

Why It Works

Different experts specialize implicitly during training. One expert might become the "code" expert, another the "math" expert, another the "European languages" expert. The router learns to send the right tokens to the right experts. This is similar in spirit to how humans use different brain regions for different tasks — efficient routing of signal, not a monolithic process.

Trade-offs

Memory: Even though only some experts run per token, all experts sit in VRAM. A 671B MoE still needs enough GPU memory for 671B parameters.

Serving complexity: Routing tokens to specific experts is harder to parallelize than dense inference. Specialized inference engines (vLLM, TensorRT, DeepSpeed) are usually required.

Training instability: Load balancing, expert collapse, and router noise make MoE training trickier than dense training.

Communication overhead: In distributed training, token-to-expert routing requires all-to-all GPU communication. Networking becomes a bottleneck.

Fine-tuning difficulty: MoE models are harder to fine-tune effectively — router dynamics drift with new data.

MoE vs Dense

Aspect	Dense	MoE
Per-token compute	All parameters	k of N experts
Memory footprint	Small for size	Large for size
Inference cost	Proportional to total params	Proportional to active params
Training difficulty	Standard	Harder (balancing, routing)
Specialization	Implicit in layers	Explicit in experts

Rule of thumb: MoE wins on cost per token and compute utilization; dense wins on memory efficiency and fine-tuning ergonomics.

Notable MoE Models

Mixtral 8×7B (Mistral, 2023): First widely-used open MoE. 47B total, 13B active.
Mixtral 8×22B (Mistral, 2024): Larger variant.
DeepSeek-V3.x / V4 preview / R1 (DeepSeek, 2024–2026): 671B total, 37B active (V3). Extreme MoE with 256 experts per layer, 8 active.
Llama 4 Scout / Maverick (Meta, 2025): Natively multimodal open-weight MoE.
Grok 4 (xAI): MoE architecture.
Closed frontier models since GPT-4 (GPT-5.x, Claude, Gemini): Widely believed to use MoE internally (not officially confirmed).

Sources: