Two-Stage Reasoning Pipeline

Updated 3 July 2026

Two-stage reasoning pipeline is an architectural design that splits a task into an upstream evidence extraction stage and a downstream inference stage.
The upstream stage isolates perception, search, or query expansion to produce a refined intermediate representation that improves task focus.
The downstream stage leverages this intermediate product for rule application, verification, and decision making, boosting accuracy and interpretability.

A two-stage reasoning pipeline is an architectural pattern in which a task is decomposed into two ordered stages with different functional roles: an upstream stage that extracts, searches, restructures, or compresses task-relevant evidence, and a downstream stage that reasons over that intermediate product to produce a final prediction, ranking, action, or explanation. Across recent work, this decomposition appears in abstract visual reasoning, reasoning-intensive retrieval, multimodal reasoning, temporal knowledge-graph extrapolation, penetration testing, search relevance, and multimodal moderation (Wang et al., 24 Dec 2025). Taken together, these papers suggest that the term denotes not a single canonical algorithm, but a family of designs that separate evidence acquisition or reasoning initialization from later inference, verification, or policy refinement (Li et al., 2021).

1. Canonical decomposition and intermediate representations

In the cited literature, the defining property of a two-stage reasoning pipeline is the presence of an explicit interface between stages. Stage 1 does not directly solve the full task; instead, it produces an intermediate representation that Stage 2 consumes. The intermediate object may be a natural-language description, localized visual evidence, a clue path, an expanded query, a deformed canonical reference, or a reasoning-augmented training signal. This makes the decomposition operational rather than merely conceptual.

System	Stage 1	Stage 2
DIVER	Iterative query expansion	Retrieval and reranking
P2R	Perceiver localizes evidence	Reasoner answers from annotated image and crops
CluSTeR	RL clue searching	Temporal reasoning over clue-induced graphs
Pentest-R1	Offline RL on walkthroughs	Online RL in a CTF environment
Q-Ponder	Cold-start initialization	GRPO fine-tuning
DR $^2$ Seg	Multimodal reasoning to description	Referring segmentation verification

The intermediate representation is central to how these systems factor reasoning. In the ARC-style benchmark analysis, each image is first converted independently into a natural-language description, yielding an enriched task $\widetilde{T}$ , after which a separate model $h:\widetilde{\mathcal{T}}\rightarrow\mathcal{Y}$ induces and applies the rule (Wang et al., 24 Dec 2025). In Perceive-to-Reason, the Perceiver predicts boxes

$\mathcal{B} \sim \pi_p(\cdot \mid I,\tilde{Q}_p;\theta),$

and the Reasoner then answers from the annotated image and crops

$Y \sim \pi_r(\cdot \mid I_a, I_c, Q;\theta)$

(Li et al., 1 Jul 2026). In CluSTeR, the first stage returns clue paths from history, and the second stage reasons over a sequence of clue-induced graphs to score future facts (Li et al., 2021). This suggests that the boundary between stages is usually a deliberately engineered semantic bottleneck.

2. Upstream stages: perception, search, expansion, and compression

A recurrent upstream role is to make the downstream reasoning problem better posed. In visual reasoning, this often means separating perception from inference. The ARC-style two-stage pipeline is explicitly designed so that each image is processed in isolation during perception, preventing leakage of cross-image inductive signals and isolating reasoning from perceptual bottlenecks (Wang et al., 24 Dec 2025). In the same-model setting, this separation improved Mini-ARC from 8.05% to 20.13%, Bongard-LOGO from 62.00% to 73.00%, and ACRE from 22.00% to 34.50%, while manual inspection attributed approximately 80 percent of failures to perception errors (Wang et al., 24 Dec 2025).

In fine-grained visual reasoning, Stage 1 frequently performs localization rather than full inference. P2R treats the major bottleneck as evidence finding: for Qwen3-VL-4B on V-Star, direct chain-of-thought achieved 81.7% overall, whereas providing an oracle hint with ground-truth boxes and crops boosted performance to 90.6% (Li et al., 1 Jul 2026). The paper’s interpretation is that localizing the right evidence is itself a reasoning prerequisite. A similar semantic compression appears in DR $^2$ Seg, where the first rollout transforms a complex query into a self-contained description $\mathcal{D}$ before any second-pass verification (He et al., 15 Jan 2026).

In retrieval systems, Stage 1 often takes the form of query reasoning. DIVER’s upstream phase combines document preprocessing with iterative query expansion. The paper states that DIVER-QExpand performs two rounds of retrieval and expansion, retrieving the top-5 documents in each round, excluding previously seen documents in later rounds, and using QWEN-R1-Distill-14B with temperature 0.7 (Long et al., 11 Aug 2025). The goal is not mere query rewriting, but propagation of reasoning intent into retrieval space. In temporal knowledge graphs, CluSTeR’s clue-searching stage instead performs a time-constrained RL beam search over historical paths, reducing a large historical graph to a compact set of candidate clues (Li et al., 2021).

Across these examples, Stage 1 does not simply pre-process input. It selects or constructs the state on which later reasoning will operate. A plausible implication is that the upstream stage is most valuable when raw input entangles irrelevant variability with the actual inferential substrate.

3. Downstream stages: inference, verification, reranking, and decision making

The downstream stage typically performs the task that is most recognizably “reasoning” in the narrow sense: rule induction, answer generation, ranking, control, or verification. In the ARC-style pipeline, the second stage induces and applies rules over textualized descriptions rather than raw visual input (Wang et al., 24 Dec 2025). In P2R, the Reasoner uses both global context from the annotated image and local detail from crops; the paper reports that P2R-4B reaches 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K (Li et al., 1 Jul 2026).

For reasoning-intensive retrieval, DIVER’s downstream stage is itself layered. After candidate generation, the top-100 documents are processed by a pointwise reranker using Qwen-2.5-32B-Instruct and a listwise reranker using a larger reasoning LLM such as DeepSeek-R1-0528 (Long et al., 11 Aug 2025). The pointwise score is interpolated with retrieval score as

$S_{\text{pointwise}} = 0.6\cdot S_{\text{reranker}} + 0.4\cdot S_{\text{retriever}},$

while listwise reranking adds global consistency (Long et al., 11 Aug 2025). The paper’s broader claim is that the system should be read not as ordinary retrieve-then-rerank, but as reason about the query, retrieve with a reasoning-aware model, then reason again over the candidates.

Verification is another common Stage-2 role. DR $^2$ Seg’s second rollout replaces the original query with the generated description and checks whether the model can still localize correctly from $(\mathcal{I},\mathcal{D})$ alone (He et al., 15 Jan 2026). This gives the first-stage description an explicit self-containment test. In multimodal moderation, KidsNanny routes object labels as text, not raw pixels, into a second-stage OCR plus 7B LLM that performs contextual confirmation; the full Stage 1+2 pipeline reaches 81.40% accuracy and 86.16% F1 at 120 ms (Panchal et al., 17 Mar 2026). In search relevance distillation, the deployed BERT student ultimately runs only the standard query-service input, but its training-stage “teacher mode” uses reasoning-augmented input to reshape the student representation (Xia et al., 13 Oct 2025).

These systems indicate that Stage 2 is not always the more computationally expensive half. Sometimes it is the more constrained half: it reasons over a representation that has already been filtered, localized, or semantically normalized.

4. Training-oriented two-stage pipelines

A distinct but related use of the two-stage pattern appears in training, where Stage 1 builds a reasoning prior and Stage 2 refines or transfers it. Pentest-R1 is explicit on this point. Stage 1 performs offline RL on more than 500 real-world penetration testing walkthroughs, producing 14K multi-turn Thought-Command-Observation tuples and rewarding both structure and command fidelity:

$\widetilde{T}$ 0

Stage 2 then applies online RL in InterCode-CTF with stepwise rewards for valid actions, flag capture, and failures (Kong et al., 10 Aug 2025). The ablation shows why the two stages are treated as complementary: the base model achieves 3.0% on AutoPenBench, Stage 1 GRPO alone and Stage 2 GRPO alone both reach 9.1%, but the full pipeline reaches 24.2%; on Cybench it reaches 15.0% unguided (Kong et al., 10 Aug 2025).

LMM-R1 uses an analogous training logic in multimodal reasoning. Its first stage, Foundational Reasoning Enhancement, applies rule-based PPO to verifiable text-only data, while the second stage, Multimodal Generalization Training, continues rule-based RL on multimodal or agent tasks (Peng et al., 10 Mar 2025). The reward is

$\widetilde{T}$ 1

combining format and accuracy, and the reported gains are 4.5% on text-only benchmarks, 4.83% on multimodal benchmarks, and 3.63% on the Football Game task (Peng et al., 10 Mar 2025). The central claim is that text-based reasoning enhancement transfers better than direct multimodal RL when the latter would otherwise overemphasize perceptual shortcuts.

Other papers use SFT-first pipelines. Gazal-R1 first performs SFT on 107,033 synthetic medical reasoning examples plus 32,682 MedReason examples, using DoRA and rsLoRA, and then applies GRPO with a composite reward for accuracy, format, cosine length, and repetition control (Adly et al., 18 Jun 2025). Q-Ponder first distills expert-style quality reasoning from Qwen2.5VL-72B, filters the refined dataset by

$\widetilde{T}$ 2

and then uses GRPO with

$\widetilde{T}$ 3

to optimize scoring accuracy and reasoning consistency (Cai et al., 3 Jun 2025).

A related training decomposition appears in System-1.5 Reasoning. Its first stage distills natural-language CoT into latent-space continuous thought; its second stage distills full-path latent System-2 reasoning into adaptive shortcut paths with depth and step shortcuts (Wang et al., 25 May 2025). The paper reports CoT-comparable GSM8K performance with over 20x speedup and a 92.31% average reduction in token generation (Wang et al., 25 May 2025). This suggests that, in training settings, two-stage pipelines often separate capability acquisition from efficiency shaping.

5. Efficiency, interpretability, and operational advantages

The two-stage form is often justified not only by accuracy but by compute allocation. KidsNanny uses a fast ViT-plus-detector front end for visual screening in 11.7 ms and invokes OCR plus a text-based 7B reasoner only when needed, rather than paying the cost of a full VLM on every image (Panchal et al., 17 Mar 2026). In search relevance, the teacher LLM is never deployed online; instead, its reasoning is compressed into a 6-layer BERT via Contrastive Reasoning Self-Distillation, and the full CRSD model retains about 98.6% of teacher performance while using only standard input at inference (Xia et al., 13 Oct 2025). In DR $\widetilde{T}$ 4Seg, the second rollout exists only during training; inference uses only the first pass, so the verification stage improves training without increasing test-time cost (He et al., 15 Jan 2026).

Interpretability is another recurrent advantage. CluSTeR’s clue paths expose the historical evidence used for temporal extrapolation (Li et al., 2021). Q-Ponder’s cold-start stage teaches a structured low-level and high-level quality analysis before a final score (Cai et al., 3 Jun 2025). Gazal-R1 trains explicit > ... clinical reasoning followed by a final assessment (Adly et al., 18 Jun 2025). In reasoning segmentation, the generated description $\widetilde{T}$ 5 acts as a compact semantic explanation of what object is being localized (He et al., 15 Jan 2026).

At the same time, efficiency gains are not equivalent to simply shortening outputs. DR $\widetilde{T}$ 6Seg’s ablation shows that adding the description reward and length reward reduces reasoning from 81.5 tokens to 26.9 tokens while improving from 64.9 gIoU and 58.8 cIoU to 68.5 gIoU and 65.8 cIoU (He et al., 15 Jan 2026). This suggests that in many two-stage systems, brevity is beneficial only when coupled to a stage boundary that preserves the essential task state.

6. Misconceptions, failure modes, and evaluation issues

A common misconception is that two-stage reasoning pipelines merely add extra modules to an otherwise unchanged task. Several papers argue the opposite: the decomposition changes what is being measured or optimized. The ARC-style study explicitly contends that current benchmarks conflate perception and reasoning, and that much of the observed human–AI gap comes from perception rather than abstract reasoning (Wang et al., 24 Dec 2025). DIVER similarly argues that reasoning-intensive retrieval cannot be reduced to topical matching; synthetic data, hard negatives, and reranking are all needed to propagate reasoning across stages (Long et al., 11 Aug 2025).

Another misconception is that more intermediate reasoning is always better. DR $\widetilde{T}$ 7Seg is motivated by overthinking, and its self-reward design penalizes redundant reasoning once accurate localization begins (He et al., 15 Jan 2026). System-1.5 Reasoning argues that not all steps deserve equal compute, and its shortcut distillation is explicitly designed to skip trivial steps (Wang et al., 25 May 2025). A closely related observation appears in StreamMA, which is not a standard two-stage architecture but a pipelined multi-agent protocol: it reports that early reasoning steps are more reliable than later ones, so passing the full chain can sometimes mislead downstream agents (Yang et al., 3 Jun 2026). The broader implication is that stage boundaries are useful partly because they block the uncontrolled accumulation of low-value intermediate computation.

Failure modes are also stage-specific. Gazal-R1 documents reward hacking, training instability, and false positive verification in MCQ-based RL, emphasizing the tension between factual recall and detailed reasoning (Adly et al., 18 Jun 2025). KidsNanny reports promising text-only recall but explicitly warns that the text-only subset contains only 44 images, that significance testing was not performed, and that generalization beyond UnsafeBench Sexual remains unproven (Panchal et al., 17 Mar 2026). Search relevance distillation finds that removing the reasoning path or replacing it with random reasoning degrades performance, indicating that the benefit is tied to semantically meaningful intermediate supervision rather than auxiliary text per se (Xia et al., 13 Oct 2025).

Overall, the literature portrays the two-stage reasoning pipeline as a disciplined way to factor tasks whose raw formulation entangles perception, search, reasoning, control, or evaluation. The central design question is not simply where to split the system, but what intermediate representation makes the second stage both easier and more faithful to the intended task.