STRIDE: Strategic Iterative Decision-Making for Retrieval-Augmented Multi-Hop Question Answering

Published 19 Apr 2026 in cs.AI | (2604.17405v1)

Abstract: Multi-hop question answering (MHQA) enables accurate answers to complex queries by retrieving and reasoning over evidence dispersed across multiple documents. Existing MHQA approaches mainly rely on iterative retrieval-augmented generation, which suffer from the following two major issues. 1) Existing methods prematurely commit to surface-level entities rather than underlying reasoning structures, making question decomposition highly vulnerable to lexical ambiguity. 2) Existing methods overlook the logical dependencies among reasoning steps, resulting in uncoordinated execution. To address these issues, we propose STRIDE, a framework that separates strategic planning, dynamic control, and grounded execution. At its core, a Meta-Planner first constructs an entity-agnostic reasoning skeleton to capture the abstract logic of the query, thereby deferring entity grounding until after the reasoning structure is established, which mitigates disambiguation errors caused by premature lexical commitment. A Supervisor then orchestrates sub-question execution in a dependency-aware manner, enabling efficient parallelization where possible and sequential coordination when necessary. By dynamically deciding whether to retrieve new evidence or infer from existing facts, it avoids redundant queries and error propagation, while fusing cross-branch information and reformulating failed queries to enhance robustness. Grounded fact extraction and logical inference are delegated to specialized execution modules, ensuring faithfulness through explicit separation of retrieval and reasoning. We further propose STRIDE-FT, a modular fine-tuning framework that uses self-generated execution trajectories from STRIDE, requiring neither human annotations nor stronger teacher models. Experiments show that STRIDE achieves robust and accurate reasoning, while STRIDE-FT effectively enhances open-source LLMs.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a modular framework that disentangles reasoning into strategy, control, and execution layers for robust multi-hop question answering.
The paper employs dynamic scheduling and self-supervised fine-tuning to enhance retrieval accuracy and execution efficiency.
The paper demonstrates improved performance and resilience across benchmarks, especially under noisy, compositional, and deep reasoning conditions.

STRIDE: Strategic Iterative Decision-Making for Retrieval-Augmented Multi-Hop Question Answering

Motivation and Problem Setting

Multi-hop question answering (MHQA) requires the synthesis of evidence and reasoning over multiple documents, posing a significant challenge beyond single-hop queries resolvable via simple information retrieval. The dominant paradigm, Retrieval-Augmented Generation (RAG), enables LLMs to ground outputs in external sources. Standard or even iterative RAG pipelines, however, are hindered by two central problems: (1) premature entity grounding preceding reasoning skeleton abstraction, causing brittle disambiguation and error cascades, and (2) rigid, sequential sub-question scheduling that neglects complex dependency structures among reasoning steps.

Figure 1: Key challenges in current iterative RAG are premature entity grounding (top, leading to cascading errors) and rigid scheduling (bottom, failing to account for dependency-rich sub-questions).

These deficiencies are especially detrimental for complex, compositional queries where successful inference is contingent on both robust high-level strategy and adaptive, context-aware execution control.

STRIDE Framework: Hierarchically Structured Reasoning

STRIDE directly addresses these limitations via explicit decomposition of the MHQA process into three decision-making layers—Strategy, Control, and Execution—mirroring real-world hierarchical decision paradigms.

Figure 2: The STRIDE framework decomposes reasoning into strategy, control, and execution modules, each responsible for a distinct set of responsibilities.

Strategy Layer: Meta-Planner and General Strategy

The Meta-Planner first constructs an abstract, entity-agnostic reasoning skeleton (“General Strategy”), separating logical flow from surface-level entity commitments. Only once the overall logic and dependency paths across entities and relations are set does the system instantiate a “Concrete Plan” of executable, entity-specific sub-questions. This separation improves robustness to lexical ambiguity and enables transferable planning over structurally similar queries.

Control Layer: Supervisor and Adaptive Scheduling

The Supervisor dynamically orchestrates sub-question execution according to their logical dependencies. By maintaining the evolving execution state, it enables:

Parallel execution of independent branches
Sequential or fork-join coordination for dependent sub-questions
Adaptive choice between evidence retrieval and inference-only reasoning
On-the-fly query rewrites when retrieval fails or disambiguation is required
Cross-branch information fusion and robust fallback behaviors for execution failures

Execution Layer: Extractor and Reasoner

Execution is further modularized via specialized LLM-based units:

Extractor for atomic fact grounding from retrieved documents
Reasoner for logical synthesis over current fact sets

This decomposition ensures interpretability, faithfulness, and more structured supervision for downstream fine-tuning.

Modular Self-Supervised Fine-Tuning (STRIDE-FT)

To improve open-source LLM deployment within STRIDE, STRIDE-FT introduces modular self-supervised fine-tuning using execution traces from STRIDE itself, thus requiring neither human annotation nor teacher models. Components are fine-tuned for plan preference (Meta-Planner), effective rewrites (Supervisor), minimal necessary fact extraction (Extractor), and concise, deterministic answer generation (Reasoner), using trajectory-level success signals and outcome filtering.

Empirical Evaluation

Experiments span three MHQA benchmarks (2WikiMultihopQA, HotpotQA, MuSiQue) and both open- and closed-source LLMs (Qwen3-8B, GPT-4o-mini).

Key empirical findings:

STRIDE outperforms baselines across all datasets and metrics, with particularly strong gains for highly compositional MuSiQue queries.
Separation of strategy, control, and execution leads to marked performance improvements over monolithic or flat iterative RAG architectures (Table 1).
STRIDE-FT enables open-source models (Qwen3-8B) to close or even surpass the performance gap with closed-source models, demonstrating the efficacy of self-supervised modular adaptation.
Figure 3: F1 scores show that abstract (Meta) planning generally outperforms direct, entity-centric planning, especially in retrieval-augmented settings for multi-hop QA.

Furthermore, STRIDE demonstrates the smallest performance degradation under corpus expansion to 50,000 documents, confirming its robustness in high noise, open-domain conditions.

Figure 4: STRIDE maintains superior performance under large, noisy retrieval corpora, showing smallest performance drop among evaluated methods.

It also exhibits strong gains at increasing reasoning depths (number of hops), confirming its ability to manage complex, long-horizon inference chains.

Figure 5: STRIDE consistently outperforms baselines at all reasoning depths (hops) in MuSiQue, confirming robustness in complex multi-hop settings.

Ablation studies indicate:

All core modules are critical; the Supervisor’s dynamic scheduling yields the largest independent contribution.
Modularity in fine-tuning provides strictly additive improvement, with the Reasoner benefiting most from component-level adaptation.

STRIDE is also highly efficient—achieving higher F1 at lower average token usage and wall-clock inference time compared to strong baselines, due to precise and non-redundant retrieval-generation cycles.

Figure 6: (a) STRIDE achieves the lowest failure rate; (b) STRIDE and especially STRIDE-FT require fewer average iterative refinements per instance, indicating higher execution efficiency and reliability.

Finally, STRIDE's structured, focused sub-questions attain better retrieval yields at smaller top- $k$ settings compared to baselines.

Figure 7: STRIDE achieves superior performance with lower retrieval top- $k$ values, reflecting more targeted and effective sub-question generation.

Theoretical and Practical Implications

STRIDE’s explicit separation of meta-level reasoning, dependency-aware scheduling, and faithfulness-preserving execution introduces new modularization principles for MHQA and broader retrieval-augmented NLP pipelines. This approach enables not only greater robustness and efficiency but also creates naturally interpretable execution traces, supporting both model auditing and targeted self-improvement. Practically, the self-supervised fine-tuning paradigm democratizes high-performance multi-hop QA for open-source models, facilitating deployment under real-world cost and privacy constraints.

Future Prospects

Future directions include:

Extending STRIDE’s planning and control paradigm to other knowledge-intensive reasoning domains (e.g., multi-modal QA, procedural reasoning).
Leveraging STRIDE-style modular traces for continual self-improvement and process-level explainability in large-scale AI agents.
Exploring automatic plan correctness verification and more sophisticated fallback or recovery mechanisms in extremely noisy or adversarial retrieval settings.

Conclusion

STRIDE provides a principled, hierarchical framework for multi-hop question answering, demonstrating that explicit separation of strategy, control, and execution modules combined with self-supervised modular fine-tuning can drive both accuracy and robustness in retrieval-augmented reasoning tasks. The approach is broadly applicable to complex QA and reasoning settings, especially as open-source LLM capability and structure-aware training paradigms continue to advance.