GEO

Tokenization

Tokenization is the process of splitting natural-language text into the minimum units — "tokens" — that an LLM actually processes. Every LLM input, output, billing charge, and context window limit is measured in tokens, not words.

Tokenization is the process of splitting natural-language text into the minimum units — "tokens" — that an LLM actually processes. Every LLM input, output, billing charge, and context window limit is measured in tokens, not words.

Why It Matters

Tokens are the base currency of LLMs. OpenAI, Anthropic, and Google all bill API usage per token, and context windows are defined by token counts. The same piece of content can cost 2–3x more tokens depending on language and text structure, so understanding tokenization has direct cost and performance implications for GEO, content strategy, and AI app development.

How Tokenization Works

Most modern LLMs use Byte Pair Encoding (BPE) or variants like SentencePiece and tiktoken.

  1. The tokenizer builds a vocabulary by merging frequent character combinations found in training data.
  2. Input text is split against this vocabulary using longest-match.
  3. Common English words become a single token; rare words and non-English text get split into multiple tokens.

English example: "tokenization" → ["token", "ization"] (2 tokens) Korean example: "토큰화" → ["토", "큰", "화"] or finer UTF-8 byte splits, typically 6–9 tokens

Quirks of Non-English Tokenization

English averages ~1.3 tokens per word, but languages like Korean, Japanese, or Thai can use 1.5–2 tokens per character. Two reasons:

Training data mix: Major LLM training corpora are 1–3% Korean, meaning few dedicated Korean tokens enter the vocabulary.

Unicode fallback: Out-of-vocabulary characters fall back to UTF-8 byte-level splitting, so a single character can become 2–3 tokens.

As a result, a Korean blog post consumes roughly 50% more tokens than its English equivalent — and fits less content into the same context window.

GEO Implications

Information density: Non-English content pays more per token, so tight sentences, clear headings, and compact phrasing improve citation efficiency.

Eliminate redundancy: Repeating the same meaning wastes precious token budget during LLM processing.

Front-load key information: When the token budget is tight, LLMs prioritize earlier content. Inverted-pyramid writing wins.

Bilingual entity names: Adding English terms alongside local-language proper nouns ("토큰화(Tokenization)") improves matching against English queries.

Sources: