Papers
Topics
Authors
Recent
2000 character limit reached

PG-CoT: Perception-Grounded Chain-of-Thought

Updated 26 December 2025
  • PG-CoT is a paradigm that integrates explicit perceptual cues from images, audio, or sensors into each reasoning step.
  • It enhances traditional chain-of-thought models by reducing hallucinations and improving traceability through grounded evidence.
  • The approach has demonstrated significant gains in tasks like social reasoning, geospatial analytics, and 3D scene understanding across multiple benchmarks.

Perception-Grounded Chain-of-Thought (PG-CoT) refers to a family of prompting and modeling strategies for vision (and broader multimodal) LLMs in which intermediate reasoning steps are explicitly anchored in perceptual evidence extracted from images, audio, or other sensory inputs. PG-CoT stands in contrast to traditional Chain-of-Thought (CoT) approaches that primarily operate on language alone, often resulting in brittle explanations or hallucinated outputs. By enforcing a protocol where every inference—whether in social reasoning, spatial navigation, or analytic scene understanding—must be directly grounded in the raw perceptual signal, PG-CoT enhances model interpretability, generalization, and trustworthiness across modalities and tasks.

1. Fundamental Concepts and Instantiations

PG-CoT is characterized by its insistence that intermediate rationales are traceable back to perceptual evidence at each reasoning stage. In most implementations, the model first extracts or describes what is directly seen, heard, or otherwise sensed (the "Perception" stage), and only then transitions to higher-level inference, relational reasoning, or norm-based judgment. Specific instantiations include:

  • Cognitive Chain-of-Thought (CoCoT): A three-stage protocol for visually grounded social reasoning in VLMs, comprising Perception, Situation, and Norm inference. The Perception step requires explicit enumeration of observable facts, preventing premature reliance on linguistic priors (Park et al., 27 Jul 2025).
  • Geo-CoT (PG-CoT for Geospatial Analytics): A three-phase workflow of Planning, Grounding, and Synthesis, utilized in RSThinker for remote sensing. Every inference step includes pixel-level references, bounding boxes, or region coordinates, ensuring factual auditability (Liu et al., 26 Sep 2025).
  • Grounding CoT for Maze Reasoning: In vision-centric maze-solving, each CoT step carries a minimal descriptor (e.g., spatial coordinate pairs), which is linearized into the token stream, deliberately minimizing linguistic flourish in favor of concise, grounded actions (Du et al., 27 Nov 2025).
  • Audio-CoT: Extends the paradigm to audio inputs in LALMs, with explicit chains anchored in the audio embedding, description, or segment-level evidence (Ma et al., 13 Jan 2025).

All these paradigms share a commitment to verifiable and interpretable reasoning anchored in explicit perception.

2. Formal Structures, Pipeline Designs, and Mathematical Protocols

The structure of PG-CoT varies with modality and task, but common formal designs include:

  • Input Mapping: Images II, audio xx, or 3D scenes SS are mapped via pre-trained encoders to feature representations (fv(I)f_v(I) for images, g(x)g(x) for audio).
  • Perception Prompting: For images, prompts like "Based on the image, describe what is directly observable" invoke the model's visual modules or upstream captioners to enumerate objects, attributes, and actions (Park et al., 27 Jul 2025). In object referring, candidates B={bi}B = \{b_i\} are retrieved and each is justified through binary reasoning steps (Jiang et al., 4 Jun 2025).
  • Chain Tokenization: In spatial settings, each token rt(g)r_t^{(g)} in the chain is paired with a grounding descriptor gtg_t, such as coordinate (xt,yt)(x_t, y_t) or object mask, and is typically linearized into the text stream for supervised training (Du et al., 27 Nov 2025).

No universally adopted formal notation exists for the visual component, though coordinate-based descriptors, bounding-box emissions, and tokenized image features are common.

Exemplary PG-CoT Pipeline

A generic inference pipeline involves:

  1. Perception Extraction: P1P_1 \leftarrow Perception(I,Q)(I, Q)
  2. Reasoned Inference: S2S_2 \leftarrow Reasoning(P1)(P_1)
  3. Normative or Task Synthesis: N3N_3 \leftarrow Synthesis(S2)(S_2)
  4. Answer Output: Final answer / rationale(N3)(N_3)

Variants such as SceneCOT interleave module calls for task recognition, region localization, entity grounding, and explicit cue integration at special tokens in a joint trace (Linghu et al., 19 Oct 2025).

3. Key Applications and Modalities

PG-CoT protocols have been adapted to a range of multimodal tasks:

  • Social Reasoning: Disambiguating intent and interpreting conversational norms from images and ambiguous utterances (VAGUE, M³CoT datasets) (Park et al., 27 Jul 2025).
  • Geospatial Analytics: Object counting, scene classification, visual grounding, and captioning from remote sensing imagery, with strict region-level traceability (Geo-CoT380k) (Liu et al., 26 Sep 2025).
  • 3D Scene Understanding: Navigation, spatial queries, and attribute reasoning within egocentric 3D environments, made possible by object proposal modules and tokenized scene coordinates (SceneCOT-185K) (Linghu et al., 19 Oct 2025).
  • Object Referring and Visual Grounding: Reasoned comprehension of referring expressions, with explicit box-level reasoning and abstention mechanisms (Jiang et al., 4 Jun 2025).
  • Audio Reasoning: Sound event extraction and inference tasks, with chains anchored in segment-wise audio evidence (Ma et al., 13 Jan 2025).

A plausible implication is that PG-CoT paradigms generalize across modalities, provided perceptual signals can be structurally encoded and invoked during intermediate reasoning.

4. Experimental Protocols and Empirical Findings

PG-CoT frameworks consistently outperform baseline CoT and direct-prompt paradigms in accuracy, interpretability, and robustness, as evidenced in quantitative benchmarks:

Model / Method Task / Dataset PG-CoT Variant Metric / Gain Reference
CoCoT (Perception-Situation-Norm) VAGUE (GPT-4o, Gemini) Full PG-CoT +8–14.7% over CoT/Direct (Park et al., 27 Jul 2025)
RSThinker (Geo-CoT) VG, OC, Det, SC, VQA, Captioning PG-CoT + GRPO +20–70 points over base (Liu et al., 26 Sep 2025)
SceneCOT SceneCOT-185K / 3D Reasoning Explicit grounding High grounding-QA coherence, strong performance (Linghu et al., 19 Oct 2025)
Rex-Thinker HumanRef, RefCOCOg PG-CoT w/ rejection +13.8 pt in rejection, SOTA DF1 (Jiang et al., 4 Jun 2025)
Mazes (Qwen2.5-VL-7B) Visual reasoning, maze grid sizes Short PG-CoT Fastest convergence, best cross-size accuracy (Du et al., 27 Nov 2025)
Cantor Scaffolding ScienceQA, MathVista Decision-expert PG-CoT +4–10 pp over non-grounded baseline (Gao et al., 2024)

Ablation studies consistently show that short, minimally sufficient perceptual chains ("least grounding") generalize best, while longer chains accelerate convergence only up to a point. Perception-only ablations improve performance on certain tasks, but full multi-stage protocols best balance accuracy and conservative confidence (Park et al., 27 Jul 2025, Du et al., 27 Nov 2025).

5. Interpretability, Trustworthiness, and Design Considerations

PG-CoT enhances model verifiability and interpretable error analysis by tying every inference step to explicit sensory evidence. In Rex-Thinker, the rejection mechanism—abstaining when no candidate matches—reduces hallucinations by 13.8 points in rejection score (Jiang et al., 4 Jun 2025). SceneCOT's injection of numeric probabilities and visual clues biases the LLM towards grounded answers (Linghu et al., 19 Oct 2025). The use of box coordinates, visual tokens, or segmented attention supports user-side auditing and post hoc explanation.

Common design challenges include:

  • Ensuring perceptual grounding quality (captioner reliability, vision module precision)
  • Calibrating chain length and prompt size against model context windows
  • Avoiding over-conservatism or brittle step-by-step failure propagation
  • Dataset construction so that perceptual facts are necessary and sufficient

Open problems remain in extending rejection and abstention mechanisms, scaling to multi-object interactions, and achieving true cross-modal coherence in joint reasoning.

6. Generalization Dynamics and Practical Guidelines

A central finding is the "short is long" effect—minimalistic, perception-anchored chains yield optimal generalization across scales and domains. In maze-solving, PG-CoT with coordinate-only steps achieves fast convergence (200 RL steps to 90% accuracy) and highest accuracy on unseen grid sizes, compared to verbose or visually manipulated chains (Du et al., 27 Nov 2025).

Practical guidelines include:

  • Emitting only essential grounded tokens (coordinates, object IDs)
  • Normalizing descriptors for scale invariance
  • Linearizing multimodal tokens into the generation stream
  • Mixing scales and error types at supervised fine-tuning (SFT), trusting locality to generalize at RL
  • Using lightweight format tags to preserve PG-CoT structure during RL (Du et al., 27 Nov 2025)

In sum, perception-grounded chain-of-thought reasoning enables multimodal models to produce transparent, interpretable, and robust inferences, particularly when models are explicitly scaffolded to emit and attend to concrete perceptual evidence throughout their chain of reasoning. This approach is now regarded as essential for high-stakes domains requiring factual auditability, such as social reasoning, remote sensing, and complex referential tasks.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Perception-Grounded Chain-of-Thought (PG-CoT).