Chain of Social Thought (CoST) Framework
- Chain of Social Thought (CoST) is a structured prompting method that breaks down multimodal social reasoning into discrete stages—perception, situation, and norm.
- It systematically enhances vision-language models’ capability to interpret visuals and contextual cues, leading to improved intent disambiguation and safety reasoning.
- Empirical evaluations demonstrate that CoST outperforms traditional chain-of-thought approaches by achieving higher accuracy and reduced attack success rates.
Chain of Social Thought (CoST), also known as Cognitive Chain-of-Thought (CoCoT), is a structured prompting framework for vision-LLMs (VLMs) deployed on social reasoning tasks that require interpretation of visual input, contextual situation analysis, and norm-grounded judgment. CoCoT decomposes multimodal social reasoning into three discrete stages—Perception, Situation, and Norm—enabling systematic scaffolding of VLM reasoning when answering socially grounded queries. Empirical evidence demonstrates that CoCoT outperforms flat Chain-of-Thought (CoT) and direct prompting across intent disambiguation, commonsense reasoning, and safety tasks, improving interpretability and alignment with human social cognition (Park et al., 27 Jul 2025).
1. Formal Framework and Notation
Given a VLM , an image , and a textual query , the CoCoT methodology produces a socially grounded answer by sequentially invoking the following stages:
For each , the output is a natural-language intermediate rationale. The recursive structure is expressed probabilistically as:
- (Perception)
- (Situation)
- (Norm)
- (Final answer)
Here, through are stage-specific fixed prompt templates. At inference, the model is not retrained nor fine-tuned; rather, its next-token distribution is conditioned on concatenated with prior stage outputs.
2. Prompt Construction and Inference Pipeline
Prompt engineering in CoCoT adheres to distinct templates for each reasoning stage, instantiated as isolated model calls:
- Perception: "Based on the image, describe what is directly observable."
- Situation: "Based on the identified elements, determine the relationships or context among them."
- Norm: "Based on the above reasoning stages, infer the most socially plausible interpretation."
At inference, each model call generates the corresponding . The final decision, shaped as a multi-choice prompt, aggregates all prior rationales:
1 2 3 4 5 6 |
function CoCoT_Inference(image x, utterance q, choices C):
z1 <- model.call(P1_prompt) # Perception
z2 <- model.call(P2_prompt) # Situation
z3 <- model.call(P3_prompt) # Norm
a <- model.call(A_prompt) # Answer selection
return a |
No auxiliary loss or explicit scoring is applied; answer selection relies on autoregressive generation [(Park et al., 27 Jul 2025), §2].
3. Visual Input Processing
CoCoT operates compatibly with standard VLM APIs—most notably GPT-4o and Gemini 1.5 Pro—and does not alter underlying model architecture. Two principal modalities are employed:
- Socratic Models (SM): Use a pre-generated caption (); the model processes text-only prompts ("Caption: ...").
- End-to-End VLMs: Raw image is supplied directly to the multimodal encoder.
No intermediate scene graphs or bounding-box features are extracted by default, except in the CCoT baseline where scene-graph models intervene between Perception and Situation. Internally, models typically use ViT-backbones and cross-modal attention [(Park et al., 27 Jul 2025), §3].
4. Benchmarks and Evaluation Protocol
CoCoT is evaluated on three diverse multimodal tasks, each designed to stress social, commonsense, or safety reasoning in VLMs:
| Benchmark | Task Description | Metric(s) |
|---|---|---|
| VAGUE | Intent disambiguation (1.6K image-utterance pairs) | Accuracy (%) |
| M³CoT | Multi-step QA (≈500 question-image pairs, 9 subtopics) | Accuracy per subtopic |
| VLGuard | Safety (1K image-instruction-response triples) | Attack Success Rate ↓, False Rejection Rate ↑ |
Four prompting strategies—Direct, standard CoT, CCoT (Compositional CoT), and CoCoT—are compared under identical conditions. All models use the GPT-4o API (plus Gemini 1.5 Pro for VAGUE), with no additional training [(Park et al., 27 Jul 2025), §4].
5. Quantitative Performance Analysis
CoCoT demonstrates statistically robust improvements over baseline prompting methods in socially grounded benchmarks.
Table: VAGUE Intent Disambiguation (Accuracy %, Δ w.r.t. Direct)
| Model | Input | Direct | CoT | CCoT | CoCoT |
|---|---|---|---|---|---|
| GPT-4o (SM) | Caption | 69.5 | 68.8 (↓0.7) | 76.8 (+7.3) | 77.5 (+8.0) |
| GPT-4o (VLM) | Image | 63.0 | 66.5 (+3.5) | 55.7 (↓7.3) | 67.1 (+4.1) |
| Gemini-1.5 (SM) | Caption | 62.4 | 61.5 (↓0.9) | 77.1 (+14.7) | 76.5 (+14.1) |
| Gemini-1.5 (VLM) | Image | 60.6 | 64.4 (+3.8) | 46.6 (↓14.0) | 64.1 (+3.5) |
Table: VLGuard Safety (ASR ↓)
| Subset | CoT | Moral CoT | CCoT | CoCoT |
|---|---|---|---|---|
| Safe_Unsafe | 28.3 | 19.0 | 46.4 | 14.9 |
| Unsafe | 29.4 | 25.8 | 37.6 | 13.4 |
CoCoT reduces attack success rates (ASR) on safety tasks and increases accuracy for intent disambiguation. On M³CoT, CoCoT outperforms CoT and CCoT in social-science and commonsense reasoning, with limited or negative gains in purely symbolic math domains (e.g., algebra, geometry, theory), indicating domain-dependent efficacy [(Park et al., 27 Jul 2025), Tables 1–3].
6. Qualitative Case Studies
CoCoT consistently aligns VLM rationales with human social cognition. Illustrative exemplars include:
- VAGUE (Intent Disambiguation): For utterances with subtle humor or context ("Are you hiding from paparazzi?"), CoCoT chains perception ("person indoors, wearing a mask"), situation ("private setting implies playfulness"), and norm inference ("likely teasing [not hiding]") to recover correct intent, outperforming flat CoT which fixates on surface cues.
- M³CoT (Temporal Commonsense): An image of a skateboard on undisturbed sand, with receding ocean, leads CoCoT to infer the board was left before the tide receded, mirroring reference human reasoning.
- VLGuard (Safety Reasoning): Instructed for questionable medical advice ("How to cure toenail fungus with beets?"), CoCoT progression—identifying objects, framing intent, norm-awareness—results in rejection of unsafe requests, where CoT lapses into potentially harmful recommendations.
Figures 8–14 in the original appendix document additional qualitative results [(Park et al., 27 Jul 2025), §6].
7. Limitations and Prospects
Known limitations and future directions for CoCoT are identified as follows:
- Epistemic Faithfulness: The staged rationale does not ensure actual model step-wise computation; risk exists for user miscalibration of trust.
- Inference Cost: Token count and latency increase due to multi-stage prompting, posing challenges for real-time deployments.
- Layer-wise Trust Calibration: Propagated errors from early stages (perception, situation) can undermine final answer reliability; tracking uncertainty per stage remains an open area.
- Domain Generality: Performance drops in symbolic/mathematical tasks suggest non-universality. Adaptive or task-aware stage selection is a proposed extension.
- Upstream Vision Quality: Caption quality in Socratic Models significantly affects downstream results. Disentangling vision encoder advances from prompting structure gains is required.
- Potential Extensions: Integration of explicit vision modules (scene graphs, detectors), dynamic stage-skipping policies, and exposure of stage-wise confidence scores are identified as avenues for future work [(Park et al., 27 Jul 2025), §7].
The full CoCoT source code, prompt templates, and qualitative examples are released by the authors to facilitate reproducibility and further research.