Trustworthy Reasoning with Structured Facts

Updated 4 July 2026

TRSF is a paradigm that formalizes reasoning by converting LLM outputs into explicit, machine-readable structured facts with clear dependency and provenance.
It operationalizes verification through formal intermediates, using JSON/YAML representations and external provers to ensure each reasoning step is logically valid.
TRSF is applied across domains such as legal reasoning, clinical diagnostics, and multi-agent systems, significantly enhancing auditability and trust in automated reasoning.

Searching arXiv for the specified papers to ground the article in current literature. {"queries":["id:(Chen et al., 31 Mar 2026)","id:(Zhang et al., 24 Oct 2025)","id:(Yang, 21 Feb 2026)","id:(Gajjar et al., 30 Apr 2026)","id:(Kondo et al., 24 Aug 2025)","id:(Malik et al., 25 Jun 2026)","id:(Feng et al., 9 May 2026)","id:(Li et al., 6 Apr 2026)","id:(Zhan et al., 13 Apr 2026)"],"max_results":10} Trustworthy Reasoning with Structured Facts (TRSF) is a paradigm for getting LLMs to produce reasoning chains that are both correct and checkable, not just answers that happen to be right. In this paradigm, intermediate reasoning is externalized as explicit, machine-readable facts, links, or steps, and trustworthiness is tied to verification, auditability, provenance, or stage-wise evaluation rather than to final-answer accuracy alone. In the logical-reasoning setting, PRoSFI operationalizes this idea through structured formal intermediates that are verified by formal provers; in multi-agent settings, TRSF functions as a shared, provenance-aware fact substrate that continuously organizes, validates, and synchronizes evidence across agents (Chen et al., 31 Mar 2026, Zhang et al., 24 Oct 2025).

1. Problem formulation and intellectual rationale

The central motivation for TRSF is the mismatch between answer correctness and reasoning validity. In outcome-based RL, models are rewarded when the final answer is correct and the format looks good, but they are never directly penalized for incorrect or nonsensical intermediate steps as long as the final answer matches the target. In the PRoSFI study, the Outcome-CoT baseline on ProverQA-hard raises Answer Correct Rate from 46.0% to 91.31%, yet GPT-based Soundness is only 21.97%, so a large fraction of correct answers are backed by unsound or dubious reasoning (Chen et al., 31 Mar 2026).

This problem is usually expressed through two linked notions. Step-level correctness requires each inference step to be logically valid relative to its premises. Path-level soundness requires the entire reasoning chain to be a valid derivation of the conclusion from the premises. TRSF treats both as first-class targets rather than incidental by-products of answer optimization (Chen et al., 31 Mar 2026).

In agentic systems, the same concern appears as a state-management problem. Co-Sight defines TRSF as the “memory and evidence substrate” of the system: a shared facts module that continuously extracts, organizes, validates, and synchronizes evidence across agents, so that later reasoning is grounded in source-verified, traceable information rather than ephemeral free-form CoT (Zhang et al., 24 Oct 2025). A plausible implication is that TRSF is not a single algorithm but a design principle: make the latent support structure of reasoning explicit, typed, and auditable.

2. Structured facts as the core representation

TRSF systems replace opaque intermediate text with structured objects that expose dependencies, provenance, and local inferential commitments. In PRoSFI, each reasoning step is a record with four fields—id, dependencies, conclusion, and rule—and the model outputs a JSON or YAML sequence inside <summary>...</summary>. Each step is atomic, each uses fewer than 5 dependencies, and the last step’s id and conclusion must match one answer option. The resulting JSON array is a small proof graph or DAG of structured facts (Chen et al., 31 Mar 2026).

In Co-Sight, the structured fact store is typed at the memory level rather than the proof-step level. Facts are organized as given facts, retrieved facts, derived facts, and assumptions, and they are maintained through a three-tier compression pipeline: tool level for raw tool metadata, notes level for concise summaries and uncertainty, and facts level for stable, verified knowledge reused across agents (Zhang et al., 24 Oct 2025). This yields a provenance-aware blackboard on which verification and conflict resolution can operate.

A legal-domain instantiation replaces proof steps with ontology-linked legal reasoning nodes. The legal knowledge graph contains Fact, Provision, LegalNorm, and LegalApplication, with the central path Provision → LegalNorm → LegalApplication ← Fact. LegalApplication is the explicit reasoning step that applies a legal norm to a fact, making the latent connection from law to case facts inspectable and machine-readable (Kondo et al., 24 Aug 2025). Across these variants, the recurring TRSF pattern is explicit dependency structure: facts are not only stored, but linked in a way that preserves inferential roles.

3. Verification, reward shaping, and executable reasoning

A defining feature of TRSF is that structured facts are not merely explanatory artifacts; they are inputs to verification or execution procedures. In PRoSFI, each structured step is converted into a prover query of the form $\Gamma_s \vdash \varphi_s$ , where dependencies(s) supplies $\Gamma_s$ and conclusion(s) supplies $\varphi_s`. Lean 4, Z3, and Prover9 are used as backends. The reward is trajectory-level but verification-sensitive:$ R = \begin{cases}

1.0, & \text{answer correct and all steps verified} \

0.3, & \text{answer correct, but some steps not verified} \

0.1, & \text{format correct, but answer incorrect} \

0.0, & \text{output format incorrect or other failure cases.} \end{cases}

$This separates correct-but-unverified chains from fully verified chains and aligns GRPO with globally verified reasoning rather than answer-only success [2603.29500]. TRUE extends the same logic from formal proofs to executable process specifications. It defines an explanation as a structured sequence$ E=(e_1,\dots,e_T) $of executable reasoning steps, and it calls an explanation trustworthy if an independent blind executor, given only$ E $and not the original problem, can recover the correct answer,$ \hat y = V(E) $. TRUE then builds feasible-region DAGs$ G=(S,E)$ over perturbation neighborhoods and performs class-level causal failure mode analysis with Shapley values, turning reasoning trustworthiness into a multi-level object: executable at the instance level, structurally stable in a local neighborhood, and diagnostically analyzable at the class level [2602.18905]. Conflict-aware RAG provides an adjacent formulation. The reasoning-trace-augmented RAG framework decomposes reasoning into document-level adjudication, conflict analysis, and grounded synthesis, producing citation-linked answers or the exact refusal string `CANNOT ANSWER, INSUFFICIENT EVIDENCE`. Its Conflict-Aware Trust-Score (CATS) evaluates grounded refusal, answer correctness, grounded citation, and behavioral adherence under explicit conflict types such as outdated information, misinformation, and conflicting opinions or research outcomes [2512.16795]. In TRSF terms, this is verification over retrieved fact clusters rather than over theorem-prover sequents. ## 4. Cross-domain instantiations TRSF has been instantiated across multiple domains, with the structured-fact substrate changing by modality while the trust objective remains stable. | Paper | Domain | Structured facts and verification target | |---|---|---| | "RSAT: Structured Attribution Makes Small LLMs Faithful Table Reasoners" [2605.00199] | Table QA | `reasoning_steps` with `cited_cells`; NLI-based faithfulness and citation validity | | "CORTEX: A Structured Reasoning Benchmark for Trustworthy 3D Chest CT MLLMs" [2606.27264] | 3D chest CT | Four-stage diagnostic trace: task understanding, visual observation, diagnostic reasoning, answer synthesis | | "Structured Causal Video Reasoning via Multi-Objective Alignment" [2604.04415] | Video reasoning | Structured Event Facts with time, person, action, scene, object, camera, event caption | | "DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding" [2605.08888] | Long multimodal documents | Evidence pages, evidence regions, factual statements, final answer | | "From Answers to Arguments: Toward Trustworthy Clinical Diagnostic Reasoning with Toulmin-Guided Curriculum Goal-Conditioned Learning" [2604.11137] | Clinical diagnosis | Toulmin-structured argument $A=\{D,R,W,B,Q,Y\}$ | In tables, RSAT uses a strict JSON schema in which each natural-language step cites specific table cells by coordinates. GRPO optimizes a composite reward centered on NLI-based faithfulness, citation validity, and parsimony. Across six SLMs, faithfulness improves from 0.224 to 0.826 on average, citation validity reaches 0.992, and removing the faithfulness reward collapses faithfulness from 0.97 to 0.03 in ablations [2605.00199]. In 3D radiology, CORTEX reconstructs the missing diagnostic process as a four-stage trace mirroring a radiologist’s workflow. It supplies 76,177 validated reasoning traces over CT-RATE and evaluates them with five rubric scores—task understanding, observation fidelity, hypothesis evaluation, reasoning logic, and answer correctness—plus expert radiologist review, with 93% inter-rater agreement on correctness [2606.27264]. Here the structured facts are linguistically organized by anatomy and hypothesis–evidence relations rather than by symbolic formulas. In video reasoning, Structured Event Facts serve as an explicit prior before reasoning begins. Factum-4B first emits `` blocks containing event-level records with timestamps and event attributes, then performs `` with “Global Search”, “Causal Verification”, and “Final Alignment”. During RL, format, task reward, and length reward are optimized as a multi-objective problem via Pareto-Frontier guided Advantage Balancing rather than a single scalarized GRPO reward [2604.04415]. In long-document QA, DocScope treats trustworthy reasoning as prediction of a trajectory $</p> <p>y=(\mathcal{P}, \mathcal{R}, \mathcal{F}, a),$ %%%%4%%%%\mathcal{P}%%%%5%%%%\mathcal{R}%%%%6%%%%\mathcal{F}%%%%7%%%%a $is the answer. The benchmark shows that even among correct answers, the highest observed rate of complete evidence chains is only 29%, and region grounding is the weakest stage across models (<a href="/papers/2605.08888" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Feng et al., 9 May 2026</a>).</p> <p>Clinical diagnosis introduces argumentative structure. <a href="https://www.emergentmind.com/topics/contact-guided-curriculum-learning-cgcl" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">CGCL</a> models the diagnostic argument as$ A=\{D,R,W,B,Q,Y\} $, where$ D $is clinical evidence,$ R $is differential diagnosis,$ W $is pathophysiological rationale,$ B $is principled justification,$ \Gamma_s$0 is certainty assessment, and $\Gamma_s$1 is the final diagnosis. The curriculum progresses from fact gathering and differential generation to justification and rebuttal, then to qualified conclusion (Zhan et al., 13 Apr 2026).

5. Evaluation of trustworthiness

TRSF requires metrics that separate answer quality from process quality. PRoSFI uses three core metrics: Answer Correct Rate, Reward Hit Rate—the fraction of outputs with maximum reward $\Gamma_s$2—and Soundness, defined as correct answer plus valid reasoning path under an LLM judge. Its main empirical claim is that Reward Hit has much higher correlation with Soundness than Answer Correctness does, making fully verified chains a stronger proxy for trustworthy reasoning than correct answers alone (Chen et al., 31 Mar 2026).

TRUE pushes evaluation beyond single instances. It defines Executable Accuracy, Original Accuracy, Executable Consistency, and Executable Recovery Rate, then studies local robustness through feasible-region DAGs with step weights $\Gamma_s$3, where $\Gamma_s$4 is semantic correctness and $\Gamma_s$5 is blind-execution success rate across a perturbation neighborhood. At the class level it evaluates the stability of discovered failure modes with Jaccard similarity and Kendall’s $\Gamma_s$6 (Yang, 21 Feb 2026). This yields a trustworthiness profile that includes operational validity, local stability, and recurring structural failure patterns.

DocScope adds hierarchical trajectory evaluation with inter-stage decoupling. Page Localization is evaluated first; Region Grounding and Fact Extraction are then evaluated only on correctly retrieved pages. Oracle studies further show that giving models oracle facts yields the largest gain, identifying faithful perception and fact extraction as the dominant capability bottleneck (Feng et al., 9 May 2026). A plausible implication is that in multimodal TRSF, fact extraction often dominates downstream reasoning quality more than the answer generator itself.

Conflict-aware RAG contributes a complementary evaluation axis. CATS scores grounded refusal, answer correctness, grounded citation, and behavioral adherence under conflict. In end-to-end mode, supervised fine-tuning on structured reasoning traces improves Qwen’s answer correctness from 0.069 to 0.883 and behavioral adherence from 0.074 to 0.722, showing that conflict-aware structured reasoning is measurable and trainable rather than purely descriptive (Mishra et al., 18 Dec 2025).

6. Limitations, tensions, and research directions

Current TRSF systems remain constrained by representation coverage and domain scope. In PRoSFI, Reward Hit does not fully guarantee global soundness because the structured intermediates may not capture every nuance of the natural-language reasoning, and current experiments are in relatively controlled FOL settings such as ProverQA and Knights & Knaves (Chen et al., 31 Mar 2026). The legal knowledge graph notes a related structural gap: the current schema lacks explicit Norm→Norm and Fact→Fact edges, limiting representation of layered reasoning, supporting or constraining norms, and nested inferences (Kondo et al., 24 Aug 2025).

Verification pipelines also introduce their own bottlenecks. TRUE is computationally heavy because blind execution, perturbation generation, and Shapley-based failure analysis are expensive, and several meta-tasks still depend on LLM judgments (Yang, 21 Feb 2026). CORTEX relies on reports as proxies for images, uses an LLM judge despite clinician-designed rubrics, and does not yet show downstream gains from training models on the benchmark (Malik et al., 25 Jun 2026). RSAT uses the same NLI model for both training reward and evaluation, raising the risk that models may optimize for that evaluator rather than for human-judged grounding, even though ablations show the faithfulness reward is essential (Gajjar et al., 30 Apr 2026).

In multi-agent TRSF, synchronization and conflict resolution remain delicate. Co-Sight depends on precise alignment of disagreement points across reasoning traces, and the reliability of structured facts is bounded by the reliability of external tools and modalities (Zhang et al., 24 Oct 2025). More broadly, the literature repeatedly points toward the same next steps: richer domain-specific intermediate languages, hard constrained decoding or schema-based decoding, per-step rather than trajectory-only credit assignment, stronger autoformalization, uncertainty-aware facts, and tighter coupling between symbolic verifiers and neural generators. A plausible synthesis is that TRSF is evolving from a prompt-formatting technique into a broader neuro-symbolic regime in which evidence extraction, representation, verification, and answer synthesis are jointly engineered and jointly evaluated.