Sub-Goal Verifiable Reward (SGVR) Framework

Updated 15 January 2026

SGVR is a reinforcement learning framework that decomposes complex tasks into independently verifiable sub-goals for dense and transparent feedback.
It uses specialized verifiers—ranging from multimodal classifiers to unit test executions—to assess each sub-goal and aggregate reliable reward signals.
Empirical studies show SGVR improves performance in domains like multimodal reasoning, geometric proofs, and code generation, while addressing reward hacking vulnerabilities.

The Sub-Goal Verifiable Reward (SGVR) Framework is a general methodology for reinforcement learning (RL) and agent training that decomposes complex tasks into verifiable sub-goals, allowing reward signals to be assigned based on independent, auditable evidence for each intermediate milestone or answer component. This approach is motivated by the inadequacy of end-to-end (outcome-based) reward methods, which impede robust learning in reasoning, multimodal, tool-using, or code-generation settings due to sparsity, misalignment, and reward hacking vulnerabilities. SGVR frameworks systematically extract, verify, and aggregate fine-grained signals, enabling dense, reliable supervision via explicit sub-goal success. This paradigm has been instantiated in diverse domains including multimodal reasoning, geometric proof, long-form retrieval-augmented generation, code synthesis, and agent skill acquisition.

1. Formal Structure of Sub-Goal Verifiable Reward

SGVR relies on a decomposition of the agent’s output or trajectory into a set of sub-goals—atomic components (blanks, proof steps, nuggets, unit tests, skills) with deterministic, independently checkable verification criteria. Each sub-goal corresponds to a unique, reconstructible evidence item, and the SGVR reward aggregates component-wise scores to a structured or scalar feedback signal.

Let $a$ denote the agent’s answer, $G = \{g_1, \ldots, g_J\}$ the set of sub-goals, and $y$ the ground-truth answer annotated at the sub-goal level. The SGVR verifier $f_\theta(a, y)$ outputs a vector $s = [s_1, \ldots, s_J]$ , $s_j \in \{0,1\}^{m_j}$ , encoding the binary (or graded) correctness of each blank or component in sub-goal $g_j$ (Zhang et al., 7 Aug 2025, Chen et al., 8 Jan 2026, Ma et al., 16 Oct 2025, Wang et al., 7 Jan 2026).

A scalar reward is derived as

$R_{\text{SGVR}}(a, y) = \frac{1}{K} \sum_{j=1}^J \text{mean}(s_j), \quad K = \sum_{j=1}^J m_j,$

or, more generally,

$R_\phi(q, \hat{y}) = \frac{\sum_{i=1}^{k} w_i V_\varphi(q, \hat{y}, r_i)}{\sum_{j=1}^k w_j},$

where $V_\varphi(q, \hat{y}, r_i)$ is a learned verifiable score per rubric $G = \{g_1, \ldots, g_J\}$ 0 (e.g., "nugget-as-rubric" for long-form QA) and $G = \{g_1, \ldots, g_J\}$ 1 are rubric weights (Ma et al., 16 Oct 2025).

The formal structure extends to multi-turn trajectories or skills in agentic systems, where the total reward is a sum over sub-goal verifications with evidence bundles:

$G = \{g_1, \ldots, g_J\}$ 2

This evidence-centric decomposition enables replayable, auditable, and security-hardened reward signal construction (Huang et al., 28 Dec 2025).

2. Verification Mechanisms and Architectures

SGVR requires a sub-goal verifier capable of evaluating, with high reliability, the correctness or support for each atomic sub-goal:

Multimodal Reasoning: The verifier is a model consuming vision features $G = \{g_1, \ldots, g_J\}$ 3 and text embeddings $G = \{g_1, \ldots, g_J\}$ 4. After fusion, per-sub-goal heads classify blanks via $G = \{g_1, \ldots, g_J\}$ 5 (Zhang et al., 7 Aug 2025).
Proof/Mathematical Reasoning: Each sub-goal is a numeric checkpoint extracted from formal proof skeletons; correctness is evaluated as exact or tolerance-based numeric match (Chen et al., 8 Jan 2026).
Retrieval-Augmented Generation: For each nugget-based rubric, a generative verifier $G = \{g_1, \ldots, g_J\}$ 6 outputs ternary support labels (support/partial/not), with aggregation across blocks, employing models like Search-Gen-V (Qwen3-4B-Instruct backbone) trained with distillation and RL (Ma et al., 16 Oct 2025).
Code Generation: Each unit test forms a sub-goal; verification is done via test execution, with pass/fail outcomes. Weights are dynamically estimated from pass rates and density-normalized (Wang et al., 7 Jan 2026).
Agentic Skills: Each skill is accompanied by pre/postcondition contracts. Replay-based verification is conducted over held-out suites, registering signed evidence bundles to ensure skill validity (Huang et al., 28 Dec 2025).

All frameworks rely on deterministic, replayable evaluation functions to enable offline or post-hoc reward auditing.

3. Reward Aggregation and RL Optimization

The per-rollout or per-trajectory reward is a (weighted) sum or average over verified sub-goals. To maximize robust learning, RL algorithms are adapted:

Policy Optimization: SGVR is deployed with PPO or Group Relative Policy Optimization (GRPO), where groups of rollouts per prompt are scored, mean-normalized, and used in clipped-importance-sampling objectives. For example, in code generation:

$G = \{g_1, \ldots, g_J\}$ 7

with $G = \{g_1, \ldots, g_J\}$ 8 mixing global and turn-level advantages (Wang et al., 7 Jan 2026).

Skill Graph Environments: In agent self-improvement, each candidate skill's promotion and reward eligibility depend on passing signed contract checks and accumulating replayable evidence, with periodic adversarial stress testing and bounded-memory discipline (Huang et al., 28 Dec 2025).
Hybrid and Hierarchical Tasks: For reasoning chains or table reconstructions, reward terms may target schema compliance, intermediate reconstructions, stepwise process conformity, and final answers, each admitted as a verifiable component (Sinha et al., 13 Oct 2025, Zhang et al., 7 Aug 2025).

4. Domain-Specific Instantiations and Experimental Evaluations

Table: Representative SGVR Applications

Domain	Sub-goal Extraction	Verification Signal
Multimodal VQA	Fill-in blanks, multi-steps	Model-based, semantic+numeric
Geometry Proof	Formal proof skeleton steps	Numeric equality/checkpoints
Retrieval QA	Nugget/rubric mining	Ternary generative/verifiable
Code Generation	Unit test decomposition	Test execution, weighted
Agent Skills	Skill nodes/edges in graph	Interface contract checks

Multimodal Reasoning: Seed-StructVRM achieved state-of-the-art on 6/12 public VQA-style benchmarks and a curated STEM-Bench, with improvements of +3.72% absolute on total score and up to +8% per domain (Zhang et al., 7 Aug 2025).
Geometry/Math: SGVR yielded +9.7% accuracy in geometric reasoning, +8.0% in general math, +2.8% in general reasoning benchmarks versus pretrained models. Gains were attributed to the Skeleton Rate providing dense, aligned supervision (Chen et al., 8 Jan 2026).
Retrieval QA: Rubric-based SGVR delivered F1 scores (rubric-level/sample-level) near those achieved by much larger oracles, while boosting correlation with independent comprehensiveness metrics ( $G = \{g_1, \ldots, g_J\}$ 9). Verification efficiency improved by %%%%20 $g_j$ 21%%%% for 1,000 rubrics (Ma et al., 16 Oct 2025).
Code Generation: VeRPO’s SGVR offered up to +8.83% absolute gain in pass@1 compared to outcome-based or RM-based baselines, and reduced degenerate-group ratios by $y$ 2 (i.e., non-zero signal efficiency) (Wang et al., 7 Jan 2026).
Agentic LLMs/Skills: Audited skill-graph self-improvement leverages SGVR to gate agent improvement with evidence-backed reward decomposition, addressing reward hacking, behavioral drift, and compliance logging (Huang et al., 28 Dec 2025).

5. Limitations, Security, and Future Extensions

While SGVR corrects key weaknesses of outcome-only supervision, several limitations persist:

Verifier Reliance: Performance is bottlenecked by the accuracy and coverage of the verifier (semantic, numeric, rubric-based, or tool-execution)—bias or gaps in the verifier propagate directly to SGVR reward fidelity (Zhang et al., 7 Aug 2025, Ma et al., 16 Oct 2025).
Sub-goal Formulation: In domains such as non-numeric proofs or diagrammatic reasoning, sub-goal decomposition and reliable mapping from abstract predicates to verifiable tests remain unresolved (Chen et al., 8 Jan 2026).
Sparse Sub-goal Regimes: Very short test suites or homogeneous sub-task difficulties can degrade SGVR robustness, suggesting the need for adaptive weighting or hybrid reward designs (Wang et al., 7 Jan 2026).
Security/Robustness: Threat models assume append-only logging, enclave-based verifier isolation, cryptographic signing, and deterministic replay for evidence—security guarantees depend on rigorous enforcement of boundaries and transparency (Huang et al., 28 Dec 2025).

Proposed extensions include integrating proof assistant outputs as sub-goal sources, hybridizing with human feedback for coverage of non-automatable tasks, and expanding SGVR to new domains such as planning and code synthesis with compositional tool checkers (Zhang et al., 7 Aug 2025, Chen et al., 8 Jan 2026).

6. Significance and Broader Impact

SGVR reframes reward modeling as the aggregation of independently verifiable intermediate achievement, bridging the gap between coarse supervised approaches and opaque, model-based reward functions. Empirically, SGVR yields:

Dense, low-variance, and interpretable feedback, reducing overfitting and improving out-of-distribution robustness (Sinha et al., 13 Oct 2025).
Stronger, more faithful chain-of-thought and process conformity, enhancing model trust and transparency (Ma et al., 16 Oct 2025).
Auditability and reproducibility in long-horizon agentic learning and self-improvement, since every reward component is reconstructible from signed evidence artifacts (Huang et al., 28 Dec 2025).
Broad transferability, with gains in cross-domain performance, as step-level reasoning competencies generalize beyond training targets (Chen et al., 8 Jan 2026).

SGVR thus establishes a new foundation for verifiable, modular and governance-ready RL architectures in complex AI systems.