Papers
Topics
Authors
Recent
2000 character limit reached

VLAA-Thinking Dataset: Visual-Language Reasoning

Updated 27 November 2025
  • VLAA-Thinking is a large-scale dataset defined by complex visual-language reasoning, offering structured chain-of-thought for both SFT and RL splits.
  • It employs a six-step construction pipeline to distill high-quality, verified reasoning traces from nine diverse vision-language benchmarks.
  • The dataset facilitates empirical studies contrasting imitative supervised instructions and adaptive, reward-driven self-reflective reasoning in LVLM pretraining.

The VLAA-Thinking dataset is a large-scale resource for visual-language reasoning, systematically constructed to address the challenges of training and evaluating large vision-LLMs (LVLMs) on complex, stepwise multimodal reasoning. Distinctively, VLAA-Thinking organizes its data into two curated splits: one for supervised fine-tuning (SFT), emphasizing clean chain-of-thought (CoT) visual reasoning without explicit self-reflective cues, and one for reinforcement learning (RL), containing more difficult traces marked by “aha moments” where self-correction or explicit cognitive leaps are required. The construction pipeline ensures high-quality, verified reasoning traces, enabling robust comparative studies of SFT and RL paradigms in LVLM pretraining, particularly in the transition toward genuine, adaptive visual reasoning behavior over rigid imitation (Chen et al., 10 Apr 2025).

1. Motivation and Design Objectives

VLAA-Thinking is motivated by the need to bridge the qualitative gap between high-performing text-only reasoners (such as DeepSeek-R1) and LVLMs, especially in emergent behaviors involving spatial, numerical, and mathematical inference. Existing LVLMs often mirror the reasoning traces of teacher or expert models, leading to "pseudo reasoning paths" that resemble true cognitive chains yet lack adaptive, authentic insight. The dataset is thus engineered to:

  • Provide high-quality, step-by-step visual reasoning traces that facilitate both supervised and RL-based instruction.
  • Promote the emergence of authentic self-reflective cues (“aha-moments”) rather than superficial or purely imitative reasoning.
  • Delineate two experimental regimes: a SFT split for clean, instructional CoT and an RL split demanding reward-driven cognitive exploration (Chen et al., 10 Apr 2025).

2. Source Corpora, Input Diversity, and Trace Statistics

The data originate from nine diverse vision-language benchmarks encompassing:

  • Synthetic visual reasoning (CLEVR-Math, Math PUMA)
  • Geometric proofs and scientific diagrams (GeoQA170K, ArxivQA)
  • Document understanding (DocVQA), real-photos (VizWiz), open-ended chat (ALLaVA-LAION)
  • Common-scene captioning (COCO and VisualGenome)

Images span from 224×224 crops (COCO) to high-resolution documents. Captions are uniformly produced via GPT-4V. The question set includes closed-ended types (counting, bounding-box, multiple choice, math expressions) and open-ended reasoning queries.

Key global statistics:

  • Nmeta=203,182N_{\rm meta} = 203,182 initial (image, question) pairs.
  • Npipeline=144,895N_{\rm pipeline} = 144,895 verified reasoning traces after pipeline filtering.
  • Final splits: NSFT=126,413N_{\rm SFT} = 126,413, NRL=25,195N_{\rm RL} = 25,195.
  • Average reasoning chain length: 6.2\approx 6.2 steps (std 1.4\approx 1.4).
  • Mean tokens per trace: 120\approx 120 (with explicit step tagging) (Chen et al., 10 Apr 2025).

3. Six-Step Construction Pipeline

VLAA-Thinking employs a structured six-phase process to ensure data fidelity and diagnostic signal:

  1. Metadata Collection: Aggregation of 203K Q&A pairs across nine benchmarks.
  2. Image Captioning: GPT-4V generates structured, information-rich captions with injected dataset-specific details.
  3. Reasoning Distillation: Compositional inference via DeepSeek-R1, producing stepwise reasoning with a formalized “> …” chain.
  4. Answer Rewriting: GPT-3.5-turbo standardizes references (e.g., “caption” to “image”) and enforces phrase consistency; outputs with sentence-length delta >>15 are excluded.
  5. Automated Verification: GPT-3.5 validates rewritten answers for semantic and factual conformity against ground-truth.
  6. Curating Splits: Traces undergo keyword scans for self-reflective cues (e.g., “double-check,” “wait,” “mistake”). SFT split excludes all such “aha” traces; RL split includes only traces with these markers (Chen et al., 10 Apr 2025).

4. Splits, Schema, and Format

Final data are separated by reflectiveness and reasoning complexity:

  • VLAA-Thinking-SFT: $126,413$ traces, each free of “aha-moments.”
  • VLAA-Thinking-RL: $25,195$ traces, each containing at least one self-reflective cue.
  • Proportional composition: pSFT0.83p_{\rm SFT} \approx 0.83, pRL0.17p_{\rm RL} \approx 0.17.

The annotation schema is implemented as JSON records:

Field Description Example Value
image_id Image reference (link or hash) "COCO_val2017_000000031769.jpg"
caption GPT-4V-generated structured description "An indoor living room ..."
question Task posed to model "3"
reasoning Ordered step list, e.g., ["Step 1: ..."] List of 4 steps (see below)
final_answer Output (free-form or exact answer) "3"

Example reasoning steps may include self-reflective (aha) content for RL, e.g., "Step 3 (aha‐moment): Wait, let me zoom in to confirm I didn’t miss any." (Chen et al., 10 Apr 2025)

5. Licensing, Access, and Evaluation

VLAA-Thinking is openly accessible under a CC BY-4.0 license as part of the UCSC-VLAA initiative (Hugging Face: https://huggingface.co/datasets/UCSC-VLAA/VLAA-Thinking). Citation: Chen et al., “SFT or RL? An Early Investigation into Training R1-Like Reasoning LVLMs,” CoLM 2025.

Evaluation utilizes the VLMEvalKit framework, targeting math reasoning tasks (MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista) with overall accuracy (%) as the principal metric. Common baseline models include Qwen2-VL-2B/7B, Qwen2.5-VL-3B/7B, and VLM-R1-Math (Chen et al., 10 Apr 2025).

6. Applications, Impact, and Limitations

VLAA-Thinking supports:

Key limitations and caveats:

  • Captioning errors propagate from upstream models; e.g., hallucinations in GeoQA170K restrict it to RL use only.
  • Reasoning traces are distilled from text-only R1 models and may not exhaustively represent visual phenomena.
  • The creation pipeline is resource-intensive, relying on multiple large LM passes per item.
  • The split is imbalanced, skewed toward SFT (0.83 vs. 0.17), which may introduce pretraining bias (Chen et al., 10 Apr 2025).

A plausible implication is that the dataset’s construction methodology, particularly the explicit separation of imitative versus reflective traces, enables fine-grained research on the cognitive properties induced by SFT and RL in LVLMs.

7. Relation to Broader Multimodal Reasoning Datasets

VLAA-Thinking distinguishes itself by explicitly operationalizing the presence or absence of self-reflective (aha) moments in its split criteria and by leveraging reasoning chains distilled from high-performing text-only models onto multimodal tasks. This supports investigation into the transition from rigid, imitative reasoning acquired via SFT to more flexible, exploratory reasoning facilitated by RL—a key open problem in aligning LVLMs' cognitive processes with those of expert reasoners (Chen et al., 10 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to VLAA-Thinking Dataset.