DecEx-RAG: Process-Supervised RAG Framework

Updated 8 October 2025

DecEx-RAG is a framework that models retrieval-augmented generation as an MDP, decomposing tasks into sequential decision-making stages.
It introduces fine-grained, step-level rewards and a dynamic pruning strategy to focus on high-reward reasoning paths and improve data efficiency.
The two-stage training process using SFT and DPO yields robust performance improvements across open-domain QA tasks with nearly 6× data construction efficiency.

DecEx-RAG is a process-supervised framework for agentic Retrieval-Augmented Generation (RAG) that models autonomous task decomposition, dynamic retrieval, and high-quality answer generation as a Markov Decision Process (MDP). Unlike traditional outcome-supervised RAG methods, DecEx-RAG introduces fine-grained process supervision and an efficient pruning strategy to enhance exploration, data expansion, and overall system effectiveness in LLMs (Leng et al., 7 Oct 2025).

1. Markov Decision Process Formulation

DecEx-RAG conceptualizes the RAG pipeline as an MDP composed of sequential decision-making and execution stages. The system state $s_t$ is represented by the current query and history of question-response pairs: $s_t = [Q, (q_1, r_1), \dots, (q_t, r_t)]$ . Each action $a_t = (\sigma_t, \delta_t)$ includes

A termination signal ( $\sigma_t$ ), determining whether the current reasoning branch should halt or continue.
A retrieval/subdivision decision ( $\delta_t$ ), specifying whether and how the agent should invoke external retrieval or decompose the task further.

The system transitions iteratively: At each time step, the agent makes a local decision and executes a substep (retrieval/action), receives immediate feedback, and updates the state. The process proceeds until task termination is invoked.

The reward for a given $(s_t, a_t)$ pair is computed over a batch of $n$ rollouts (each a simulated full reasoning trajectory):

$R(s_t, a_t) = \frac{1}{n} \sum_{i=1}^n v(\text{rollout}_i),$

where $v(\cdot)$ is an evaluation function (e.g., Exact Match or F1), scoring the correctness of the terminal answer generated by each rollout.

2. Addressing Exploration and Data Expansion Challenges

Traditional RAG approaches, such as outcome-supervised RL (e.g., Search-R1), suffer from inefficient exploration and sparse, global feedback: The model generates a complete reasoning chain before receiving a singular outcome-based reward, limiting effective guidance for intermediate decisions.

DecEx-RAG explicitly decomposes reasoning into local decisions and sub-executions, assigning intermediate rewards at each decision step. This structure enables:

Granular supervision for each reasoning transition.
More frequent and informative feedback, promoting accelerated and focused exploration during training.

To mitigate exponential search space growth common in sequential reasoning tasks, DecEx-RAG applies a dynamic branch pruning strategy within the search tree expansion process. During tree expansion:

At each node, multiple candidate sub-questions or retrieval decisions are simulated via rollouts.
The branch with the highest aggregated reward is selected for further expansion.
Redundant or low-reward branches are pruned—reducing data expansion complexity from exponential to linear relative to tree depth.

Empirical results show that data construction efficiency improves by nearly $6 \times$ , with maintained or enhanced data quality.

3. Two-Stage Optimization Procedure

The training of DecEx-RAG involves a two-stage process:

Supervised Fine-Tuning (SFT): Each candidate reasoning chain from root to leaf (i.e., the full decomposition sequence) is used to train the model on optimal action sequences.
Direct Preference Optimization (DPO): All decision branches in the search tree, including suboptimal and preferred actions, are used for preference-based policy optimization.

The DPO objective is formulated as:

$\mathcal{L}_\theta = - \mathbb{E}_{x, y_{<t}, y_t^{(w)}, y_t^{(l)} \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta( y_t^{(w)}| x, y_{<t} )}{\pi_\text{ref}( y_t^{(w)}| x, y_{<t} )} - \beta \log \frac{\pi_\theta( y_t^{(l)}| x, y_{<t} )}{\pi_\text{ref}( y_t^{(l)}| x, y_{<t} )} \right) \right]$

Here, $\pi_\theta$ is the policy under training, $\pi_\text{ref}$ is a reference policy, $\beta$ is a KL-divergence constraint, and $\sigma$ is the logistic function. Preference data is constructed from branches evaluated during search, allowing fine-grained discrimination between high- and low-quality reasoning steps.

4. Empirical Performance and Evaluation

DecEx-RAG has been benchmarked on six open-domain QA datasets, demonstrating robust, cross-domain improvements. Reported gains include:

Average absolute performance improvement of $6.2\%$ over prompt-based and outcome-supervised RL baselines.
Consistent outperformance irrespective of domain or dataset composition, substantiating that process-level policy optimization generalizes well.

The pruning strategy achieves nearly $6 \times$ efficiency in data construction—this is especially pertinent for large-scale RAG applications that otherwise suffer from massive search expansion overhead.

5. Practical and Methodological Implications

DecEx-RAG supports fine-grained, transparent, and process-supervised reasoning, making it suited for complex, multi-hop tasks and domains demanding robust logic substantiation (e.g., open-domain QA, scientific reasoning, autonomous decision workflows). The step-level supervision yields interpretable intermediate steps, each validated by either internal knowledge or external retrieval.

Linear-time search expansion through dynamic pruning enhances practicality, allowing DecEx-RAG to scale to real-world training and deployment scenarios without prohibitive resource consumption.

6. Future Directions

Several areas for future investigation are highlighted:

The development of more precise evaluation metrics tailored to multi-step reasoning, as standard token-level EM and F1 may not adequately capture the fidelity of intermediate processes.
Refinement of reward functions to provide more authoritative and context-sensitive supervision without an increase in computational complexity.
Integration of process-supervised signals in broader reinforcement learning paradigms, potentially generalizing DecEx-RAG concepts to new application domains beyond question answering.

A plausible implication is that incorporating process-level supervision and efficient combinatorial pruning, as in DecEx-RAG, can facilitate the transition toward agentic retrieval-augmented LLM systems that are less prone to hallucination, more scalable, and better suited to domains requiring complex, multi-stage reasoning chains.

7. Summary Table: DecEx-RAG Components

Component	Description	Impact
MDP-based Decomposition	Separates decision-making and execution steps	Granular supervision
Intermediate Reward	Step-level correctness evaluation	Improved exploration
Dynamic Pruning Strategy	Retains high-reward branches, prunes redundant paths	Linear expansion
Two-stage Training	SFT for optimal chains; DPO for preference learning	Robust optimization
Cross-domain Evaluation	Tested on 6 QA datasets with strong empirical gains	Generalizability

DecEx-RAG is defined by its process-level MDP framework, intermediate rewards, pruning-based efficiency enhancement, and robust cross-domain QA performance. These properties substantiate its position as a leading agentic RAG methodology for reason-aware LLM augmentation (Leng et al., 7 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

DecEx-RAG: Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision (2025)

Follow Topic

Get notified by email when new papers are published related to DecEx-RAG.