Sub-Question Decomposition
- Sub-question decomposition is the process of breaking down a complex query into self-contained, simpler sub-questions that collectively address the original query.
- It employs techniques such as LLM prompting, supervised sequence-to-sequence models, and structural parsing to isolate distinct facets and reasoning steps.
- Applications include multi-hop question answering, retrieval augmentation, and multi-table reasoning, yielding significant empirical performance gains.
Sub-question decomposition is the process of transforming a complex, multi-faceted, or compositional question into a set of simpler sub-questions whose answers collectively address the original information need. This technique has become central in modern machine learning and information retrieval pipelines, particularly for tasks—such as multi-hop question answering (QA), video-language understanding, fact verification, retrieval-augmented generation (RAG), numerical and multi-table reasoning, and knowledge-based question answering—where direct, holistic reasoning is infeasible for current models due to limitations in token capacity, data coverage, transparency, or reasoning depth.
1. Formal Definitions and Algorithmic Foundations
Sub-question decomposition formalizes the mapping from a complex query to a set (or sequence, or graph) of sub-questions , each designed to isolate a distinct facet, reasoning step, or piece of evidence. The process can be captured algebraically as a function ,
where each is ideally “self-contained” and collectively, the ’s are sufficient to answer (Huang et al., 9 Oct 2025, Xie et al., 2024, Ammann et al., 1 Jul 2025).
Algorithmic approaches include:
- Prompting LLMs: Using few-shot or tailored templates to invoke a frozen LLM (e.g., GPT-4, GPT-3.5-turbo) to produce sub-questions by direct generation (Huang et al., 9 Oct 2025, Xie et al., 2024, Li et al., 9 Oct 2025).
- Supervised Sequence-to-Sequence Models: Training T5/BART models to autoregressively predict sub-questions given the original question (optionally with additional context or schema) (Guo et al., 2022, Zhu et al., 2023, Chen et al., 2022, Eyal et al., 2023).
- Unsupervised or Semi-supervised Methods: Leveraging denoising/back-translation over large question corpora (e.g., ONUS) (Perez et al., 2020), or mining pseudo-decompositions for either direct model training or as weak supervision.
- Structural Parsing: Mapping to semantic structures (e.g., Abstract Meaning Representation graphs, QPL sequential operators, or explicit trees) whose segmentation yields sub-questions aligned with programmatic or logical reasoning steps (Deng et al., 2022, Huang et al., 2023, Eyal et al., 2023, Gandhi et al., 2022).
- Taxonomies and Typing: Classifying sub-questions into roles—such as “core,” “background,” or “follow-up” in RAG evaluation (Xie et al., 2024); or “literal” vs. “implied” in fact verification (Chen et al., 2022)—to guide downstream processing and evaluation.
2. Integration into Downstream Inference Pipelines
Once decomposed, sub-questions are incorporated into various pipeline architectures:
- Sequential Answering & Aggregation: Each is answered by a base model (single-hop QA, VLM, retriever, etc.), and their answers are recombined—either directly or via another LLM call that conditions on 0 plus the sub-answers (Huang et al., 9 Oct 2025, Ammann et al., 1 Jul 2025, Zhu et al., 2023, Radhakrishnan et al., 2023).
- Parallel or DAG Execution: In the presence of compositional structure (e.g., DAGs in AGQA-Decomp (Gandhi et al., 2022) or QDTrees (Huang et al., 2023)), sub-questions may be answered in parallel or following dependency constraints, with answers passed according to the task’s composition rules.
- Retrieval-Augmentation: RAG and Graph-RAG approaches retrieve evidence not just for 1 but for each 2, assembling complementary document (or triple) sets which are then reranked, merged, and used as input to answer synthesis (Xie et al., 2024, Ammann et al., 1 Jul 2025, Li et al., 9 Oct 2025, Luo et al., 9 Mar 2026).
- Program Synthesis and Execution: In numerical QA and text-to-SQL, sub-questions correspond to intermediate steps (e.g., specific joins, aggregates), forming interpretable programs that are incrementally constructed and executed (Eyal et al., 2023, Luo et al., 9 Mar 2026).
The architectural interfaces vary. Some maintain a strict API- or prompt/LLM-driven separation (e.g., D-CoDe has zero architectural modification, using only prompt engineering (Huang et al., 9 Oct 2025)), while others opt for parameter sharing or explicit hard-EM/training objectives tying decomposition and answer generation (Zhu et al., 2023, Guo et al., 2022).
3. Empirical Benefits and Evaluation
Across domains, sub-question decomposition yields significant gains:
- Multi-hop QA (text, knowledge base, video): Explicit decomposition narrows errors in reasoning chains, exposes model shortcuts, and enables more robust multi-step inference. Empirical results on HotpotQA show +6–12 F1 points for decomposed pipelines over one-shot baselines; similar results hold for DROP, ComplexWebQuestions, AGQA, and KBQA tasks (Guo et al., 2022, Zhu et al., 2023, Eyal et al., 2023, Huang et al., 2023, Deng et al., 2022, Gandhi et al., 2022, Ammann et al., 1 Jul 2025).
- Retrieval Coverage and Precision: In RAG, decomposing queries enables higher evidence recall (MultiHop-RAG MRR@10: +36.7%) and answer accuracy (F1: +11.6%) (Ammann et al., 1 Jul 2025). Classification of sub-questions by type refines evaluation—and direct optimization of “core” sub-question coverage increases win-rate by 74% over naive RAG (Xie et al., 2024).
- Numerical Multi-table Reasoning: Table-aligned decomposition in MTQA drives +24% recall and +55% answer gains over leading baselines (Luo et al., 9 Mar 2026).
- Multimodal Reasoning: In video and image + language domains, both deterministic and prompt-driven decomposition mitigates perceptual bottlenecks and token overload, yielding +6.2 points (EgoSchema) (Huang et al., 9 Oct 2025), and improves VQA accuracy by +5–23 points upon targeted finetuning (Zhang et al., 2024, Wang et al., 2022).
- Faithfulness and Interpretability: By forcing explicit sub-question answering, decomposition increases sensitivity to step corruption, truncation, or bias, enhancing “faithfulness” metrics by 10–20 points over Chain-of-Thought or one-shot reasoning (Radhakrishnan et al., 2023).
4. Representational and Structural Variants
Decomposition can yield:
- Flat Sequences: Ordered or unordered lists of sub-questions, suitable for pipeline or batch answering (Guo et al., 2022, Ammann et al., 1 Jul 2025, Huang et al., 9 Oct 2025, Zhu et al., 2023).
- Trees/DAGs: Structures which directly encode dependencies (QDTrees, AMR graphs, compositional DAGs in AGQA), enabling principled flow of information and explicit mapping to logical, programmatic, or KB queries (Huang et al., 2023, Deng et al., 2022, Gandhi et al., 2022).
- Typed or Annotated Collections: Taxonomically classifying sub-questions for downstream metric weighting, selective retrieval, or prioritization (Xie et al., 2024, Chen et al., 2022).
The decomposition can be deterministic (e.g., AMR-QDAMR graph segmentation), heuristic (hand-written splitting and templating (Min et al., 2019, Tang et al., 2020)), or learned (via LLMs, step-wise prompting, or neural sequence models).
5. Practical Limitations and Challenges
Despite empirical robustness, sub-question decomposition introduces several costs and open problems:
- Latency and Cost: Each sub-question entails a separate model (often LLM) invocation, leading to inference time and compute increases of up to 5–6x vs. baseline (Huang et al., 9 Oct 2025, Ammann et al., 1 Jul 2025, Xie et al., 2024).
- Over-decomposition and Noise: Generation can yield trivial, redundant, or off-topic sub-questions; noisy decompositions can propagate errors, particularly in open-ended or static queries (Huang et al., 9 Oct 2025, Xie et al., 2024).
- Prompt and Model Sensitivity: Performance and compositional coverage are heavily influenced by prompt design, LLM choice/hallucinations, and the absence of explicit loss functions or coverage objectives in many frameworks (Ammann et al., 1 Jul 2025, Huang et al., 9 Oct 2025).
- Ambiguity in Facet Coverage: Determining when decomposition is necessary, how many sub-questions to generate, and weighting their importance remains challenging. Equal weighting may not reflect user or task priorities (Xie et al., 2024, Zhang et al., 2024).
- Domain Adaptation and Supervision: Gold decompositions (for training or evaluation) are expensive; cross-domain generalization, especially for implied sub-questions or schema-sensitive decompositions, is imperfect without substantial annotated corpora (Chen et al., 2022, Guo et al., 2022, Eyal et al., 2023).
- End-to-end Trainability: Many architectures remain non-differentiable or are not amenable to explicit backpropagation of downstream losses through decomposition—limiting the capacity to jointly optimize sub-question quality relative to final performance (Huang et al., 9 Oct 2025, Zhu et al., 2023, Guo et al., 2022).
6. Theoretical Guarantees and Open Problems
Formally, for any composite task that admits a polynomial-depth, constant-fanin decomposition, intermediate supervision—i.e., training with sub-task labels—renders otherwise unlearnable problems tractable for standard sequence models (polynomial-time SGD convergence) (Wies et al., 2022). This provides theoretical grounding for the empirical success of chain-of-thought and decomposed reasoning approaches, demonstrating why tasks that are end-to-end unlearnable (e.g., random parities, deep composition) become learnable when decomposed and supervised at finer granularity.
However, these results rely on full sub-task annotation, teacher forcing at training time, and the existence of low-degree decompositions. Robustness to noisy, incomplete, or model-generated sub-questions at test time, and extension to problems without such efficient decompositions, are active areas of research.
7. Domain-Specific Adaptations and Successes
Sub-question decomposition exhibits broad applicability:
- Multimodal QA: In video (D-CoDe, AGQA-Decomp) and multimodal LLMs (Co-VQA, DecoVQA+), structured decomposition mitigates perception bottlenecks and supports variable-length, adaptive reasoning chains (Huang et al., 9 Oct 2025, Gandhi et al., 2022, Zhang et al., 2024, Wang et al., 2022).
- Fact Verification & Evidence Aggregation: Generating explicit, minimal yes/no sub-questions, both literal and implied, enhances diagnosticity, evidence retrieval, and interpretability in complex claim verification (ClaimDecomp) (Chen et al., 2022).
- Program Synthesis and Table QA: Schema-sensitive, operator-based decomposition (QPL in text-to-SQL, DMRAL for multi-table QA) provides pipeline modularity, interpretability, and controllable complexity, with concomitant accuracy and robustness gains (Eyal et al., 2023, Luo et al., 9 Mar 2026).
- Knowledge Base QA: Tree-based or computation-graph-based decompositions (QDT, RL ordering with full compositional trees) enable tractable, explainable multi-hop inference even over large KGs (Huang et al., 2023, Zhang et al., 2019).
In all domains, decomposition is both a tool for engineering more robust models and a critical axis for evaluating compositional generalization, logical faithfulness, and coverage. The future trajectory will likely feature more integrated, learnable, and domain-adaptive decomposition modules, new evaluation metrics favoring compositional transparency, and hybrid architectures balancing explicit structure with the fluency of large generative models.