Papers
Topics
Authors
Recent
Search
2000 character limit reached

Co-Sight: Auditable Long-Horizon Reasoning

Updated 4 July 2026
  • Co-Sight is a framework for long-horizon reasoning in LLM agents that redefines inference by emphasizing falsification of intermediate steps.
  • It employs two key mechanisms—Conflict-Aware Meta-Verification (CAMV) and Trustworthy Reasoning with Structured Facts (TRSF)—to audit disagreements and secure verifiable evidence.
  • The approach optimizes computational resources by focusing verification on contentious reasoning steps, thereby enhancing accuracy and transparency.

Searching arXiv for the requested topic and closely related work. Co-Sight is a framework for long-horizon reasoning in LLM-based agents that recasts inference as a falsifiable and auditable process rather than a purely generative one. Its central claim is that many failures in agentic reasoning arise because intermediate steps are insufficiently verified, provenance is lost, and tool traces become entangled with assumptions. The framework addresses this with two coupled mechanisms—Conflict-Aware Meta-Verification (CAMV) and Trustworthy Reasoning with Structured Facts (TRSF)—organized as a closed verification loop in which expert agents propose candidate reasoning chains, a shared facts module maintains verifiable anchors, and a meta-verifier selectively tests only the conflicts that matter (Zhang et al., 24 Oct 2025).

1. Conceptual basis and problem formulation

Co-Sight begins from a specific diagnosis of long-horizon agent failure: errors propagate from early substeps, tool outputs and assumptions are mixed, and final answers become difficult to audit. In response, it shifts the operative objective from producing a single answer to proposing hypotheses and falsifying them. This suggests a methodological inversion relative to standard chain-of-thought systems: the critical resource is not unrestricted reasoning depth, but selective verification depth targeted at inconsistencies.

The framework is instantiated in a multi-agent setting. Let NN expert agents produce candidates cnc_n with intermediate results and confidences. Reasoning steps form a DAG plan with step set SS. For candidate nn and step sSs \in S, the intermediate result is mn(s)m_n(s), the local analysis is h(mn(s))h(m_n(s)), and the calibrated confidence is σ(mn(s))[0,1]\sigma(m_n(s)) \in [0,1]. Collecting all intermediates and confidences yields

M={mn(s):n[N],sS},Γ={σ(m):mM}.M = \{ m_n(s) : n \in [N], s \in S \}, \qquad \Gamma = \{ \sigma(m) : m \in M \}.

Co-Sight also assumes a constraint set K\mathcal{K}, containing schemas, units, invariants, and consistency rules. Any result consistent with cnc_n0 belongs to

cnc_n1

After constraint pruning, the retained candidates are

cnc_n2

A central construct is the distinction between consensus anchors and conflicts. Recurring, cross-agent intermediates are promoted to anchors with threshold cnc_n3:

cnc_n4

Here cnc_n5 is the set of statements extracted from retained traces. Disagreement points define the conflict set

cnc_n6

The final answer is then synthesized as a coherent result in cnc_n7 maximizing an integrated score over anchors, conflicts, confidences, and analyses:

cnc_n8

2. System structure and closed verification loop

The architecture consists of cnc_n9 expert agents plus a meta-verification agent. Each expert contains a planner that produces a DAG SS0, an actor that executes steps in topological order with ReAct tool calls, a toolkit, and access to the shared TRSF module. The meta-verifier consumes the query together with each expert’s intermediates, confidences, local analyses, and response, then returns the synthesized output SS1 (Zhang et al., 24 Oct 2025).

Component Core role Primary artifact
Expert agents Propose candidate chains SS2
TRSF Organize, validate, and synchronize evidence structured facts SS3
CAMV Audit disagreements and falsify inconsistent steps anchors, refutations, repairs
Meta-verifier Integrate verified evidence into a final answer SS4

TRSF supplies structured, provenance-backed facts and anchors to CAMV. CAMV then audits only conflicts among agents’ intermediates using TRSF facts and the constraint set SS5. Verification outcomes update TRSF by changing fact status to accepted, refuted, or pending; tightening anchors SS6; and shrinking the conflict set SS7. Iteration stops when conflicts are resolved under the verification budget SS8 or when the residual conflicts are bounded and do not affect the final decision under SS9.

This organization is designed to alter the computational profile of verification. Rather than re-checking an entire chain, Co-Sight allocates computation to “disagreement hotspots.” The paper states this as a complexity shift from auditing full reasoning traces to auditing only contentious nodes. A plausible implication is that the framework treats verification not as a terminal post-processing pass, but as a dynamic control problem over the intermediate state of multi-agent reasoning.

3. Conflict-Aware Meta-Verification

CAMV formalizes verification as conflict identification plus targeted falsification. Its conflict graph is defined over the set of structured facts nn0:

nn1

where nn2 denotes contradiction under nn3. The corresponding conflict set is

nn4

Selective verification is then performed only over nn5 or the corresponding step set nn6, giving the cost relation

nn7

Hotspots are prioritized by a conflict score. A typical instantiation is

nn8

The verifier selects

nn9

and allocates tests in descending rank until the budget is exhausted.

CAMV operates through a four-stage pipeline. First, Constraint-Based Pruning excises intermediates violating sSs \in S0 and prunes dependent derivations. If no valid candidates remain, it backtracks via Elimination-by-Aspects (EBA), locating violated constraints, removing dependents, and repairing only affected sub-chains. Second, Consensus Anchoring promotes recurring intermediates to sSs \in S1 so that settled premises are not repeatedly audited. Third, Conflict Auditing ranks sSs \in S2 and runs sSs \in S3 to support or refute each hotspot, updating only the affected nodes. Fourth, Integrative Synthesis recombines valid micro-inferences from sSs \in S4 under sSs \in S5 and the fact base sSs \in S6 to produce the final answer and confidence.

The paper also emphasizes the role of diversified expert behavior. Conservative agents reinforce anchors, while radical agents expand sSs \in S7 and thereby expose low-probability but potentially correct paths. This makes disagreement itself an informative signal rather than a pathology to be suppressed.

4. Trustworthy Reasoning with Structured Facts

TRSF is the evidence substrate of Co-Sight. It continuously organizes, validates, and synchronizes given facts, retrieved facts, derived facts, and assumptions through a three-tier context compression pipeline: tool level, notes level, and facts level (Zhang et al., 24 Oct 2025). Tool level stores minimal metadata such as tool identity, parameters, and outcomes. Notes level stores concise trajectory annotations and credibility judgments. Facts level stores stable, verified knowledge promoted into the shared facts module.

A practical fact schema is specified as

sSs \in S8

Each fact thus carries provenance, timestamp, agent identity, confidence, and dependencies. TRSF imposes a no-co-acceptance consistency constraint over contradictory facts:

sSs \in S9

The update cycle comprises fact extraction, source attribution, validation, and synchronization. Candidate facts are extracted from tool outputs and reasoning notes; URLs, tool configurations, logs, and hashed artifacts are attached; validation is performed using constraint checks, cross-tool re-execution, and cross-source corroboration; and agent-local fact sets are merged into the shared store via

mn(s)m_n(s)0

where mn(s)m_n(s)1 deduplicates by id or content hash, resolves contradictions according to Eq. (6), and prefers accepted facts with stronger provenance and higher mn(s)m_n(s)2.

TRSF is therefore not just a memory mechanism in the usual retrieval-augmented sense. It is a provenance-preserving reconciliation layer tuned for verification. The framework’s case studies illustrate this concretely. In a unit-mismatch example, tool outputs containing “distance = 10 km” are recorded at the tool level, annotated at the notes level, then promoted as a fact with source and time metadata. CAMV subsequently audits only the conversion step where experts disagree, re-executes the conversion, refutes the miles-based variant, and updates the anchor set. In another example, an assumption that a company was founded in 2010 is explicitly demoted after a retrieved fact from an official registry supports 2011.

5. Evaluation, ablations, and empirical behavior

Co-Sight is evaluated on three benchmarks with different reasoning profiles. On GAIA, a benchmark of 300 real-world tool-augmented questions spanning three difficulty levels, it achieves 84.4% overall accuracy, with 95.7% on Level-1, 83.0% on Level-2, and 67.3% on Level-3. On Humanity’s Last Exam, it achieves 35.5%, and on Chinese-SimpleQA it achieves 93.8% with mn(s)m_n(s)3 experts (Zhang et al., 24 Oct 2025).

The paper attributes the GAIA gains on easier and moderate tasks to domain-gated credibility ranking and three-tier compression, which reduce context noise and allow premises to be verified without re-exploring full alternative chains. On more difficult GAIA tasks and on HLE, the reported advantage comes from conflict-aware auditing of contentious nodes combined with TRSF cross-checking through external tools such as code execution and literature retrieval.

The ablation study on Chinese-SimpleQA isolates the contributions of individual mechanisms. With mn(s)m_n(s)4 experts, the reported scores are:

  • Baseline: 82.6
  • SV (single-step verification only): 85.7
  • TRSF only: 85.0
  • CAMV (includes SV): 88.8
  • SV + TRSF: 87.7
  • Co-Sight (CAMV + TRSF): 91.2

The ensemble-size analysis further reports that with mn(s)m_n(s)5, CAMV rises from 88.3% to 91.2%, surpassing pass@mn(s)m_n(s)6 for small ensembles because it recombines micro-inferences and audits within the fixed budget mn(s)m_n(s)7. For mn(s)m_n(s)8, pass@mn(s)m_n(s)9 can exceed CAMV because a fixed budget must cover a growing h(mn(s))h(m_n(s))0, yet the full system still reaches 93.8% at h(mn(s))h(m_n(s))1.

These results are presented as evidence for a synergy claim: CAMV is more compute-efficient than full-path verification because it audits only disagreements, while TRSF strengthens grounding and reduces contextual noise. The numerical pattern in the ablations supports that claim directly.

6. Relation to adjacent approaches, limitations, and terminological scope

Co-Sight is positioned against several existing research lines. Relative to chain-of-thought and self-consistency, it does not merely broaden sampling or vote over final answers; it explicitly treats disagreements as audit targets. Relative to debate frameworks, it reconceptualizes debate as conflict detection under provenance-backed fact constraints. Relative to verifier models and post-hoc checkers such as sentence-level verification systems, it triages verification before spending compute. Relative to graph-of-thought and plan-and-act scaffolds, it adds a fine-grained verification loop without requiring a new planner. Relative to GraphRAG and LongRAG-style systems, TRSF adds continuous provenance management, consistency constraints, and synchronization targeted at verification rather than only retrieval organization (Zhang et al., 24 Oct 2025).

The framework also has stated limitations. If expert plans are incomplete, important errors may never appear in h(mn(s))h(m_n(s))2, reducing audit coverage. Multimodal verification remains bounded by vision and parsing accuracy. The paper further notes that, despite strong GAIA and HLE results, real-world deployment in safety-critical settings requires additional validation and domain-specific governance. Hyperparameter guidance is correspondingly conservative: h(mn(s))h(m_n(s))3 is recommended for small ensembles h(mn(s))h(m_n(s))4, and h(mn(s))h(m_n(s))5 is described as a small fixed integer, for example 5–20 verifications per query, depending on latency goals.

In broader arXiv usage, the expression “Co-Sight” is not unique to LLM agents. In co-salient object detection, it has also been used to describe collective visual consensus across image groups: the ability to identify the common salient object shared across related images while suppressing distractors (Zheng et al., 2022). Earlier co-saliency work similarly framed the task as jointly balancing intra-image saliency and inter-image correspondence to reveal shared salient objects despite group-level variability (Jeong et al., 2017). These usages are conceptually related only at a high level: both invoke consensus formation across multiple signals, but the agentic Co-Sight framework is specifically a conflict-aware verification architecture for long-horizon reasoning.

Taken together, Co-Sight defines a particular model of trustworthy agent behavior: reasoning is decomposed into candidate traces, stable premises are elevated into auditable anchors, contradictions are localized into a conflict set, and verification effort is concentrated where disagreement is highest. Its distinctive contribution lies less in any single module than in the closed coupling of structured factual grounding with selective falsification.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Co-Sight.