Query Decomposition
Query decomposition is a RAG technique that splits a complex, multi-part user question into several simpler sub-questions, retrieves context for each, then composes a final answer. Instead of asking the retriever to find one passage that answers everything at once, the system asks many narrow questions in parallel.
Query decomposition is a RAG technique that splits a complex, multi-part user question into several simpler sub-questions, retrieves context for each, then composes a final answer. Instead of asking the retriever to find one passage that answers everything at once, the system asks many narrow questions in parallel.
Why It Matters
Real users ask messy questions: "What's the difference between LCP and FCP, and which one matters more for mobile SEO in 2026?" A vector retriever handed that query returns passages about either LCP or FCP or mobile SEO or 2026 trends — rarely a single passage that covers all four. Query decomposition splits the question into sub-queries ("What is LCP?", "What is FCP?", "LCP vs FCP", "Mobile SEO Core Web Vitals 2026"), retrieves separately for each, and lets the model stitch the final answer from rich context. Production RAG systems at Perplexity, Glean, and Anthropic use some form of decomposition for complex questions, and LangChain's 2024 benchmarks show 15–25% accuracy gains on multi-hop QA.
How It Works
1. Decomposer LLM call: A small model takes the user query and outputs 2–5 sub-questions. Prompt: "Break this question into the minimum sub-questions needed to answer it fully."
2. Parallel retrieval: Each sub-question runs through the retriever — vector, hybrid, or keyword — independently.
3. Context aggregation: The retrieved passages from all sub-questions are combined into a single context block.
4. Final answer generation: The main model sees the original question plus all retrieved context and writes a unified answer.
5. Optional synthesis step: For multi-hop questions, an intermediate step composes partial answers before final generation.
Variants
Parallel decomposition: All sub-questions run simultaneously. Fast, good for questions where parts are independent.
Sequential decomposition (multi-hop): Later sub-questions depend on earlier answers. "Who is the CEO of inblog's biggest competitor?" needs to answer "Who is inblog's biggest competitor?" first, then look up that company's CEO.
Step-back prompting: Before decomposing, the LLM asks a more abstract version of the question to pull in broader context. Popularized by Google Research in 2024.
HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer first, embed that, and retrieve — a lightweight alternative to explicit decomposition.
When to Use It
Comparison questions: "X vs Y," "Which is better for Z"
Multi-hop reasoning: "Who founded the company that acquired Figma?"
Compound questions: "How and why" combined in one query.
Long-tail specificity: Rare questions where no single source page exists, but multiple pages each cover part.
Questions mixing concepts: "Technical SEO for SaaS blogs in Korean"
When Not To Use It
Simple single-fact questions: "What's the capital of France?" doesn't need decomposition — it adds latency and cost.
Budget-constrained applications: Decomposition multiplies retriever calls. For high-volume chat, the cost hit is real.
Domains with strong single-document answers: Legal contracts, product manuals — one good passage beats five mediocre ones.
Trade-offs
Latency: Every sub-question is a round trip. Parallel execution helps but doesn't eliminate it.
Retriever cost: Vector search calls scale linearly with sub-questions.
Decomposer quality: Bad decomposition produces bad retrievals. The decomposer prompt and model matter as much as the final generator.
Redundant retrieval: Sub-questions often overlap, pulling the same passages repeatedly. Deduplication helps.
Common Mistakes
Over-decomposing: Breaking a simple question into 10 sub-questions wastes tokens and confuses the final model.
Decomposing without grounding: Passing sub-answers through instead of source passages lets hallucinations compound across hops.
Ignoring dependencies: Running a multi-hop question in parallel when the second step depends on the first gives wrong answers.
No evaluation: Without a benchmark, you can't tell if decomposition actually helped versus the baseline single-shot RAG.
Sources: