What Is Chunking? | GEO Glossary

Chunking is the process of splitting long documents into smaller meaning-bearing units (chunks) that LLMs and vector databases can process. It's a mandatory preprocessing step in RAG pipelines before web pages, PDFs, or docs are embedded — and each chunk becomes the minimum unit an AI can cite in its answer.

Why It Matters

When AI search generates an answer, it cites the most relevant chunk — not the whole page. Two versions of the same blog post can produce completely different AI quotes depending on how they're chunked. Anthropic and OpenAI engineering blogs report that well-tuned chunking improves RAG retrieval accuracy by 30–50% over baseline. This is where the GEO principle "write in chunks" comes from.

Main Chunking Strategies

Fixed-size chunking: Splits by a fixed token count like 500 or 1,000. Simple, but breaks mid-sentence and loses context.

Recursive (sentence/paragraph): Splits paragraphs, then sentences, then words — preserving natural boundaries. The default in most RAG pipelines.

Semantic chunking: Uses embedding similarity to detect topic shifts and split there. Highest quality but computationally expensive.

Document-aware chunking: Uses Markdown or HTML ### headings as boundaries. Most effective for structured content like blog posts.

Overlap: Duplicates 10–20% of content across adjacent chunks so context doesn't get lost at the seam.

Implications for GEO Writing

Sections must stand alone: Chunks typically correspond to ### sections. If a section depends on the previous one to make sense, it breaks when cited in isolation.

Include the subject and context inside each section: Write "inblog handles…" not "this tool handles…" — each paragraph should be self-contained.

Right length: Very short sections lack enough information to be worth citing; very long sections dilute their embedding meaning. 200–500 words is the sweet spot.

Headings at topic shifts: If a single section mixes topics, chunkers split in awkward places. Add a clear ### heading whenever the topic changes.

FAQ blocks: Q&A pairs naturally form self-contained chunks, so breaking key questions into an FAQ section dramatically raises citation probability.

Sources: