MM-GCoT Dataset: Multimodal Visual Reasoning Benchmark
- MM-GCoT is a comprehensive multimodal dataset integrating stepwise chain-of-thought explanations with precise spatial grounding in both 2D and 3D contexts.
- The dataset uses a structured annotation process including region-object alignment, spatial-relationship graphs, and quality-controlled templates to ensure high accuracy.
- It significantly improves model performance by reducing visual hallucination and enhancing spatial reasoning, serving as a robust benchmark for multimodal AI research.
The Multimodal Grounded Chain-of-Thought (MM-GCoT) dataset is a large-scale, richly annotated resource for visual reasoning that uniquely integrates stepwise chain-of-thought (CoT) explanations and precise grounding annotations. MM-GCoT enables vision–LLMs to jointly learn stepwise, textually-structured reasoning while grounding each step in spatially localized visual evidence. This paradigm directly addresses the challenge of visual hallucination in multimodal LLMs (MLLMs) by supervising not only final answers but also the intermediate process of visual-spatial reasoning and localization in both 2D and 3D scenarios (Wu et al., 17 Mar 2025, Chen et al., 15 Oct 2025).
1. Construction Methodology
MM-GCoT consists of two main branches: a 2D variant (images with dense region–object annotation) and a 3D extension (scenes with full geometric reconstruction), with both designed around the principle of “grounded chain-of-thought.” Each example requires the model to sequentially identify, localize, and reason over visual entities, with explicit bounding box annotations for every referenced step.
2D MM-GCoT Construction (Wu et al., 17 Mar 2025):
- Stage 1: Region–Object Alignment — Leveraging Visual Genome’s region descriptions and bounding boxes; linking by IoU threshold to ensure alignment between textual region and visual object.
- Stage 2: Spatial-Relationship Graphs — Constructing graphs where nodes are objects; edges encode spatial (e.g., ‘above’, ‘left of’) or semantic (e.g., ‘has attribute’) relations. Multi-hop paths (length ) are sampled as CoT skeletons.
- Stage 3: Template Instantiation — Paths expanded into structured templates (object IDs, attributes, relationships, box coordinates), then translated by a pretrained LLM (DeepSeek-v3) into natural questions, reasoning steps, and answers.
- Stage 4: Quality Control — Automated LLM-based verification for train/val; 100% manual expert review for the test set (994 samples).
3D MM-GCoT Construction (Chen et al., 15 Oct 2025):
- Scene Sources — ScanNet, ScanNet++, ARKitScenes, yielding 1,500 RGB-D indoor scans with diverse object categories.
- Annotation Pipeline — 3D oriented bounding boxes (OBBs) per object, PCA-aligned to room axes, converted to axis-aligned bounding boxes .
- QA and CoT Generation — Template-based raw QA construction for eight spatial/temporal tasks, with five tasks involving explicit multi-step CoT (e.g., absolute/relative distances or directions, route planning). Stepwise reasoning paths are generated using GPT-4o prompted with scenes, category data, and bounding box overlays.
2. Dataset Structure, Annotation Schema, and Statistics
2D MM-GCoT Dataset (Wu et al., 17 Mar 2025):
- Images: 5,033
- Total GCoT Examples: 24,022 (23,028 train/val, 994 test)
- Question Types: Attribute (21.6%), Judgement (53.5%), Object (20.8%)
| Split | Attribute | Judgment | Object | Total |
|---|---|---|---|---|
| Train/Val | 5,183 | 12,849 | 4,996 | 23,028 |
| Test | 459 | 206 | 329 | 994 |
- Schema: Each example is a JSON object with fields:
image_id,question,question_type,chain_of_thought(array of reasoning steps, each withreasoningstring andbox_gtcoordinates),answer,answer_box_gt. - Annotation Quality: Test split is 100% manually curated for unambiguous questions and precise box coordinates.
3D MM-GCoT Dataset (Chen et al., 15 Oct 2025):
- Total QA Pairs: 156,000 across 1,500 scenes
- CoT Coverage: 79% have explicitly annotated multi-step CoT
- Tasks: Object Counting, Room-Size Estimation, Absolute/Relative Distance, Object Size, Relative Direction, Appearance Order, Route Planning
3D Annotation Format: Each object annotated per scene as a JSON object with an obj_id, category, and 6D bbox. Each QA pair links objects, QA text, and (when applicable) a reasoning path specifying the sequence of grounding and spatial computations.
3. Task Diversity, Supported Evaluation, and Generalization
MM-GCoT is engineered to enable and benchmark grounded CoT reasoning in both 2D and 3D, but models trained on MM-GCoT generalize to a wide spectrum of multimodal reasoning tasks without task-specific prompt engineering.
- Supported Tasks:
- Open-world Visual Question Answering (VQA)
- Referring Expression Comprehension (REC)
- Phrase grounding with multi-step explanations (2D)
- Spatial/temporal reasoning (3D): distance, direction, route planning
- Generalization: Fine-tuned MLLMs on MM-GCoT answer standard VQA by returning both an answer and the corresponding box alongside a stepwise reasoning trace (Wu et al., 17 Mar 2025, Chen et al., 15 Oct 2025).
4. Evaluation Metrics and Benchmarking Protocols
2D MM-GCoT Metrics (Wu et al., 17 Mar 2025):
- Answer Accuracy (-Acc):
- Grounding Accuracy (-Acc, [email protected]):
- Answer-Grounding Consistency (0):
1
Where 2: correct answer & correct box; 3: correct answer, wrong box; 4: wrong answer, correct box.
3D MM-GCoT Metrics (Chen et al., 15 Oct 2025):
- 3D Visual Grounding: Acc@25, Acc@50 IoU; F1@25, F1@50 (multi-object)
- Spatial Reasoning: Task-specific numerical tolerance (ObjectCount, AbsDist, RoomSize), classification accuracy for multiple-choice (RelDist, RelDir, RoutePlan, ApprOrder)
- Benchmarks: Scene-level train/val/test split ensures no scene-level leakage. Baseline results for GS-Reasoner, ablation with and without grounded CoT, and benchmarking against VSI-Bench.
5. Illustrative Annotation Examples
2D Example (Attribute):
- Image context: Red frisbee on grass
- Question: “What color is the frisbee?”
- CoT Steps: (1) Box the object, (2) Observe hue
- Answer: “Red”
- Answer Box: (35,120,210,275)
2D Example (Object):
- Image context: Desk with a mug, keyboard, and book
- Question: “Which object is placed to the left of the keyboard?”
- CoT Steps: (1) Box keyboard, (2) Find object left to it (mug)
- Answer: “Mug”
- Answer Box: (80,310,140,390)
3D Example [editor's term, schematic]:
- Spatial question: “Which two objects are the closest?”
- CoT Steps: (1) Ground objects via 3D box; (2) Compute centroids; (3) Compare distances; (4) Conclusion on pair proximity.
This multi-step format interleaves textual reasoning and explicit bounding box grounding for every referenced entity throughout the explanation path (Wu et al., 17 Mar 2025, Chen et al., 15 Oct 2025).
6. Data Representation and Modeling Implications
2D Representation: Box annotations refer to image-space rectangles; all localized steps are in pixel coordinates.
3D Representation and Dual-Path Pooling (Chen et al., 15 Oct 2025):
- Semantic features (5) per patch (e.g., SigLIP-ViT)
- Geometric features: depth back-projected to points; encoded by Point Transformer v3/Sonata
- Dual-path fusion: cross-attention for semantic alignment, interpolation for positional features, 3D positional encoding
- Patch descriptors 6 input to LLM
This unified representation allows simultaneous learning of fine-grained grounding and complex spatial reasoning by conditioning the LLM on rich scene context.
7. Research Impact and Benchmarking Insights
MM-GCoT establishes a new paradigm for teaching and evaluating vision–LLMs on stepwise, visually-grounded reasoning. Notable impacts:
- Reduction in Visual Hallucination: Training models with MM-GCoT produces marked advances in answer–grounding consistency; for example, LLaVA gains >4% on answer accuracy and over +55% absolute on consistency (from ≃10% to ≃56%) with SFT (Wu et al., 17 Mar 2025).
- Task Generalization: Learned GCoT strategies transfer to open-world VQA and referring tasks with minimal extra engineering.
- Closing Dataset Gaps: MM-GCoT bridges the gap between standard CoT datasets (which lack spatial supervision) and grounding datasets (which lack multi-step reasoning), directly supporting research on trustworthy, interpretable multimodal reasoning.
- 3D Extension: In 3D reasoning, explicit grounded CoT supervision improves spatial reasoning by 8.4% absolute compared to ablated models and achieves state-of-the-art 3D visual grounding performance, as shown in GS-Reasoner test results (Acc@50=42.2%, VSI-Bench aggregate=64.7–70.1%) (Chen et al., 15 Oct 2025).
These findings demonstrate that MM-GCoT provides a robust, multi-purpose benchmark for the systematic study of visual reasoning and grounding in both 2D and 3D multimodal AI.