Chunking
Chunking is the process of splitting long documents into smaller meaning-bearing units (chunks) that LLMs and vector databases can process. It's a mandatory preprocessing step in RAG pipelines before web pages, PDFs, or docs are embedded — and each chunk becomes the minimum unit an AI can cite in its answer.
Chunking is the process of splitting long documents into smaller meaning-bearing units (chunks) that LLMs and vector databases can process. It's a mandatory preprocessing step in RAG pipelines before web pages, PDFs, or docs are embedded — and each chunk becomes the minimum unit an AI can cite in its answer.
Why It Matters
When AI search generates an answer, it cites the most relevant chunk — not the whole page. Two versions of the same blog post can produce completely different AI quotes depending on how they're chunked. Anthropic and OpenAI engineering blogs report that well-tuned chunking improves RAG retrieval accuracy by 30–50% over baseline. This is where the GEO principle "write in chunks" comes from.
Main Chunking Strategies
Fixed-size chunking: Splits by a fixed token count like 500 or 1,000. Simple, but breaks mid-sentence and loses context.
Recursive (sentence/paragraph): Splits paragraphs, then sentences, then words — preserving natural boundaries. The default in most RAG pipelines.
Semantic chunking: Uses embedding similarity to detect topic shifts and split there. Highest quality but computationally expensive.
Document-aware chunking: Uses Markdown or HTML ### headings as boundaries. Most effective for structured content like blog posts.
Overlap: Duplicates 10–20% of content across adjacent chunks so context doesn't get lost at the seam.
Implications for GEO Writing
Sections must stand alone: Chunks typically correspond to ### sections. If a section depends on the previous one to make sense, it breaks when cited in isolation.
Include the subject and context inside each section: Write "inblog handles…" not "this tool handles…" — each paragraph should be self-contained.
Right length: Very short sections lack enough information to be worth citing; very long sections dilute their embedding meaning. 200–500 words is the sweet spot.
Headings at topic shifts: If a single section mixes topics, chunkers split in awkward places. Add a clear ### heading whenever the topic changes.
FAQ blocks: Q&A pairs naturally form self-contained chunks, so breaking key questions into an FAQ section dramatically raises citation probability.
Sources: