Co-Sight: Auditable Long-Horizon Reasoning

Updated 4 July 2026

Co-Sight is a framework for long-horizon reasoning in LLM agents that redefines inference by emphasizing falsification of intermediate steps.
It employs two key mechanisms—Conflict-Aware Meta-Verification (CAMV) and Trustworthy Reasoning with Structured Facts (TRSF)—to audit disagreements and secure verifiable evidence.
The approach optimizes computational resources by focusing verification on contentious reasoning steps, thereby enhancing accuracy and transparency.

Searching arXiv for the requested topic and closely related work. Co-Sight is a framework for long-horizon reasoning in LLM-based agents that recasts inference as a falsifiable and auditable process rather than a purely generative one. Its central claim is that many failures in agentic reasoning arise because intermediate steps are insufficiently verified, provenance is lost, and tool traces become entangled with assumptions. The framework addresses this with two coupled mechanisms—Conflict-Aware Meta-Verification (CAMV) and Trustworthy Reasoning with Structured Facts (TRSF)—organized as a closed verification loop in which expert agents propose candidate reasoning chains, a shared facts module maintains verifiable anchors, and a meta-verifier selectively tests only the conflicts that matter (Zhang et al., 24 Oct 2025).

1. Conceptual basis and problem formulation

Co-Sight begins from a specific diagnosis of long-horizon agent failure: errors propagate from early substeps, tool outputs and assumptions are mixed, and final answers become difficult to audit. In response, it shifts the operative objective from producing a single answer to proposing hypotheses and falsifying them. This suggests a methodological inversion relative to standard chain-of-thought systems: the critical resource is not unrestricted reasoning depth, but selective verification depth targeted at inconsistencies.

The framework is instantiated in a multi-agent setting. Let $N$ expert agents produce candidates $c_n$ with intermediate results and confidences. Reasoning steps form a DAG plan with step set $S$ . For candidate $n$ and step $s \in S$ , the intermediate result is $m_n(s)$ , the local analysis is $h(m_n(s))$ , and the calibrated confidence is $\sigma(m_n(s)) \in [0,1]$ . Collecting all intermediates and confidences yields

$M = \{ m_n(s) : n \in [N], s \in S \}, \qquad \Gamma = \{ \sigma(m) : m \in M \}.$

Co-Sight also assumes a constraint set $\mathcal{K}$ , containing schemas, units, invariants, and consistency rules. Any result consistent with $c_n$ 0 belongs to

$c_n$ 1

After constraint pruning, the retained candidates are

$c_n$ 2

A central construct is the distinction between consensus anchors and conflicts. Recurring, cross-agent intermediates are promoted to anchors with threshold $c_n$ 3:

$c_n$ 4

Here $c_n$ 5 is the set of statements extracted from retained traces. Disagreement points define the conflict set

$c_n$ 6

The final answer is then synthesized as a coherent result in $c_n$ 7 maximizing an integrated score over anchors, conflicts, confidences, and analyses:

$c_n$ 8

2. System structure and closed verification loop

The architecture consists of $c_n$ 9 expert agents plus a meta-verification agent. Each expert contains a planner that produces a DAG $S$ 0, an actor that executes steps in topological order with ReAct tool calls, a toolkit, and access to the shared TRSF module. The meta-verifier consumes the query together with each expert’s intermediates, confidences, local analyses, and response, then returns the synthesized output $S$ 1 (Zhang et al., 24 Oct 2025).

Component	Core role	Primary artifact
Expert agents	Propose candidate chains	$S$ 2
TRSF	Organize, validate, and synchronize evidence	structured facts $S$ 3
CAMV	Audit disagreements and falsify inconsistent steps	anchors, refutations, repairs
Meta-verifier	Integrate verified evidence into a final answer	$S$ 4

TRSF supplies structured, provenance-backed facts and anchors to CAMV. CAMV then audits only conflicts among agents’ intermediates using TRSF facts and the constraint set $S$ 5. Verification outcomes update TRSF by changing fact status to accepted, refuted, or pending; tightening anchors $S$ 6; and shrinking the conflict set $S$ 7. Iteration stops when conflicts are resolved under the verification budget $S$ 8 or when the residual conflicts are bounded and do not affect the final decision under $S$ 9.

This organization is designed to alter the computational profile of verification. Rather than re-checking an entire chain, Co-Sight allocates computation to “disagreement hotspots.” The paper states this as a complexity shift from auditing full reasoning traces to auditing only contentious nodes. A plausible implication is that the framework treats verification not as a terminal post-processing pass, but as a dynamic control problem over the intermediate state of multi-agent reasoning.

3. Conflict-Aware Meta-Verification

CAMV formalizes verification as conflict identification plus targeted falsification. Its conflict graph is defined over the set of structured facts $n$ 0:

$n$ 1

where $n$ 2 denotes contradiction under $n$ 3. The corresponding conflict set is

$n$ 4

Selective verification is then performed only over $n$ 5 or the corresponding step set $n$ 6, giving the cost relation

$n$ 7

Hotspots are prioritized by a conflict score. A typical instantiation is

$n$ 8

The verifier selects

$n$ 9

and allocates tests in descending rank until the budget is exhausted.

CAMV operates through a four-stage pipeline. First, Constraint-Based Pruning excises intermediates violating $s \in S$ 0 and prunes dependent derivations. If no valid candidates remain, it backtracks via Elimination-by-Aspects (EBA), locating violated constraints, removing dependents, and repairing only affected sub-chains. Second, Consensus Anchoring promotes recurring intermediates to $s \in S$ 1 so that settled premises are not repeatedly audited. Third, Conflict Auditing ranks $s \in S$ 2 and runs $s \in S$ 3 to support or refute each hotspot, updating only the affected nodes. Fourth, Integrative Synthesis recombines valid micro-inferences from $s \in S$ 4 under $s \in S$ 5 and the fact base $s \in S$ 6 to produce the final answer and confidence.

The paper also emphasizes the role of diversified expert behavior. Conservative agents reinforce anchors, while radical agents expand $s \in S$ 7 and thereby expose low-probability but potentially correct paths. This makes disagreement itself an informative signal rather than a pathology to be suppressed.

4. Trustworthy Reasoning with Structured Facts

TRSF is the evidence substrate of Co-Sight. It continuously organizes, validates, and synchronizes given facts, retrieved facts, derived facts, and assumptions through a three-tier context compression pipeline: tool level, notes level, and facts level (Zhang et al., 24 Oct 2025). Tool level stores minimal metadata such as tool identity, parameters, and outcomes. Notes level stores concise trajectory annotations and credibility judgments. Facts level stores stable, verified knowledge promoted into the shared facts module.

A practical fact schema is specified as

$s \in S$ 8

Each fact thus carries provenance, timestamp, agent identity, confidence, and dependencies. TRSF imposes a no-co-acceptance consistency constraint over contradictory facts:

$s \in S$ 9

The update cycle comprises fact extraction, source attribution, validation, and synchronization. Candidate facts are extracted from tool outputs and reasoning notes; URLs, tool configurations, logs, and hashed artifacts are attached; validation is performed using constraint checks, cross-tool re-execution, and cross-source corroboration; and agent-local fact sets are merged into the shared store via

$m_n(s)$ 0

where $m_n(s)$ 1 deduplicates by id or content hash, resolves contradictions according to Eq. (6), and prefers accepted facts with stronger provenance and higher $m_n(s)$ 2.

TRSF is therefore not just a memory mechanism in the usual retrieval-augmented sense. It is a provenance-preserving reconciliation layer tuned for verification. The framework’s case studies illustrate this concretely. In a unit-mismatch example, tool outputs containing “distance = 10 km” are recorded at the tool level, annotated at the notes level, then promoted as a fact with source and time metadata. CAMV subsequently audits only the conversion step where experts disagree, re-executes the conversion, refutes the miles-based variant, and updates the anchor set. In another example, an assumption that a company was founded in 2010 is explicitly demoted after a retrieved fact from an official registry supports 2011.

5. Evaluation, ablations, and empirical behavior

Co-Sight is evaluated on three benchmarks with different reasoning profiles. On GAIA, a benchmark of 300 real-world tool-augmented questions spanning three difficulty levels, it achieves 84.4% overall accuracy, with 95.7% on Level-1, 83.0% on Level-2, and 67.3% on Level-3. On Humanity’s Last Exam, it achieves 35.5%, and on Chinese-SimpleQA it achieves 93.8% with $m_n(s)$ 3 experts (Zhang et al., 24 Oct 2025).

The paper attributes the GAIA gains on easier and moderate tasks to domain-gated credibility ranking and three-tier compression, which reduce context noise and allow premises to be verified without re-exploring full alternative chains. On more difficult GAIA tasks and on HLE, the reported advantage comes from conflict-aware auditing of contentious nodes combined with TRSF cross-checking through external tools such as code execution and literature retrieval.

The ablation study on Chinese-SimpleQA isolates the contributions of individual mechanisms. With $m_n(s)$ 4 experts, the reported scores are:

Baseline: 82.6
SV (single-step verification only): 85.7
TRSF only: 85.0
CAMV (includes SV): 88.8
SV + TRSF: 87.7
Co-Sight (CAMV + TRSF): 91.2

The ensemble-size analysis further reports that with $m_n(s)$ 5, CAMV rises from 88.3% to 91.2%, surpassing pass@ $m_n(s)$ 6 for small ensembles because it recombines micro-inferences and audits within the fixed budget $m_n(s)$ 7. For $m_n(s)$ 8, pass@ $m_n(s)$ 9 can exceed CAMV because a fixed budget must cover a growing $h(m_n(s))$ 0, yet the full system still reaches 93.8% at $h(m_n(s))$ 1.

These results are presented as evidence for a synergy claim: CAMV is more compute-efficient than full-path verification because it audits only disagreements, while TRSF strengthens grounding and reduces contextual noise. The numerical pattern in the ablations supports that claim directly.

6. Relation to adjacent approaches, limitations, and terminological scope

Co-Sight is positioned against several existing research lines. Relative to chain-of-thought and self-consistency, it does not merely broaden sampling or vote over final answers; it explicitly treats disagreements as audit targets. Relative to debate frameworks, it reconceptualizes debate as conflict detection under provenance-backed fact constraints. Relative to verifier models and post-hoc checkers such as sentence-level verification systems, it triages verification before spending compute. Relative to graph-of-thought and plan-and-act scaffolds, it adds a fine-grained verification loop without requiring a new planner. Relative to GraphRAG and LongRAG-style systems, TRSF adds continuous provenance management, consistency constraints, and synchronization targeted at verification rather than only retrieval organization (Zhang et al., 24 Oct 2025).

The framework also has stated limitations. If expert plans are incomplete, important errors may never appear in $h(m_n(s))$ 2, reducing audit coverage. Multimodal verification remains bounded by vision and parsing accuracy. The paper further notes that, despite strong GAIA and HLE results, real-world deployment in safety-critical settings requires additional validation and domain-specific governance. Hyperparameter guidance is correspondingly conservative: $h(m_n(s))$ 3 is recommended for small ensembles $h(m_n(s))$ 4, and $h(m_n(s))$ 5 is described as a small fixed integer, for example 5–20 verifications per query, depending on latency goals.

In broader arXiv usage, the expression “Co-Sight” is not unique to LLM agents. In co-salient object detection, it has also been used to describe collective visual consensus across image groups: the ability to identify the common salient object shared across related images while suppressing distractors (Zheng et al., 2022). Earlier co-saliency work similarly framed the task as jointly balancing intra-image saliency and inter-image correspondence to reveal shared salient objects despite group-level variability (Jeong et al., 2017). These usages are conceptually related only at a high level: both invoke consensus formation across multiple signals, but the agentic Co-Sight framework is specifically a conflict-aware verification architecture for long-horizon reasoning.

Taken together, Co-Sight defines a particular model of trustworthy agent behavior: reasoning is decomposed into candidate traces, stable premises are elevated into auditable anchors, contradictions are localized into a conflict set, and verification effort is concentrated where disagreement is highest. Its distinctive contribution lies less in any single module than in the closed coupling of structured factual grounding with selective falsification.