Papers
Topics
Authors
Recent
2000 character limit reached

OneThinker-600k Multimodal Dataset

Updated 5 December 2025
  • OneThinker-600k is a large-scale multimodal dataset comprising 600,000 annotated samples from both images and videos for visual reasoning tasks.
  • It covers eight core tasks such as rule-based and open-ended QA, captioning, grounding, tracking, and segmentation to support comprehensive evaluation.
  • The dataset employs a systematic chain-of-thought annotation pipeline with rigorous quality filtering to boost the performance of unified multimodal models.

OneThinker-600k, constructed for the unified multimodal visual reasoning model OneThinker, is a large-scale dataset comprising 600,000 annotated samples designed to enable joint training across image and video tasks. It was developed to support generalist multimodal LLMs (MLLMs) for eight fundamental types of visual reasoning, with a systematic chain-of-thought (CoT) annotation pipeline. The corpus underpins the supervised fine-tuning of the OneThinker reasoning model and advances research toward unified, task- and modality-agnostic multimodal understanding (Feng et al., 2 Dec 2025).

1. Corpus Composition and Task Coverage

The OneThinker-600k corpus contains 600,000 multimodal samples, split approximately evenly between static images and short video clips.

Task Coverage:

  1. Rule-based Question Answering (QA): Multiple-choice, numerical, regression, math, OCR.
  2. Open-ended QA: General and scientific queries.
  3. Captioning: Free-form descriptions for both images and videos.
  4. Spatial Grounding: Identifying box coordinates or point locations in images.
  5. Temporal Grounding: Locating relevant time intervals in videos.
  6. Spatio-temporal Grounding: Simultaneous spatial and temporal localization.
  7. Tracking: Identification of object trajectory across frames.
  8. Segmentation: Region-based annotation for both image and video modalities.

The dataset’s construction did not specify per-task instance counts in tabular form; only the aggregation totaling 600,000 examples is reported. Empirically, a balance of roughly 50,000–100,000 samples per fundamental task is required to span this union.

2. Data Sources and Provenance

Samples in OneThinker-600k are selected from a diverse set of benchmarks, each addressing distinct facets of visual reasoning:

Task Data Sources (Original Benchmarks)
Rule-based QA MMMU, MathVista, MathVerse, MMBench, MMStar, ScienceQA, AI2D, MMT-Bench
Open-ended QA ScienceQA and similar open-ended QA sets
Image Captioning MMSci-Caption, MMT-Caption
Video Captioning VideoMMLU-Caption
Spatial Grounding RefCOCO, RefCOCO+, RefCOCOg
Temporal Grounding Charades (TALL), ActivityNet (dense-caption events), ANet-RTL
Spatio-temporal Grounding STVG
Tracking GOT-10k
Segmentation RefCOCO series, MeViS, ReasonVOS; region masks via SAM2

Integration across these sources enables the dataset to support multitask, multimodal reasoning research with broad empirical coverage.

3. CoT Annotation Workflow and Filtering

Annotation involved generating detailed chain-of-thought rationales using a proprietary multimodal model (“Seed1.5-VL”). The pipeline followed a uniform prompt template:

  • System-level Prompt: Instructs outputs in > (step-by-step rationale) and <answer> (final prediction or structured result) blocks. > > - Task-specific Template: For QA: “You are given an image/video and a question. First THINK out loud step by step, then ANSWER.” > > - Perception Tasks: <answer> block constrained to a JSON schema for grounding, tracking, and segmentation responses. > > The annotation pipeline incorporated multi-stage quality control: > > - For rule-based tasks (QA, grounding, tracking): automatic validation of answer format, with filtering for invalid or demonstrably incorrect outputs. > > - Open-ended QA/Captioning: An external reward model (POLAR-7B) scored each output; only those with similarity score SRM ≥ 0.7 were retained (the precise threshold is not specified, but a 0.7–0.8 range is typical). > > - All tasks: Chain-of-thought length required to be ≥20 tokens for valid inclusion. > > The resultant high-quality subset after filtering is designated OneThinker-SFT-340k, containing 340,000 rigorously CoT-annotated examples. > > ## 4. Dataset Structure, Sampling, and File Formats > > OneThinker-600k and its SFT-340k subset are stored as newline-delimited JSON (NDJSON), each sample as an object including: > >
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    {
      "id": "IMG_0001234",
      "modality": "image" | "video",
      "task": "image_qa" | "video_caption" | ...,
      "source": "MMBench" | "RefCOCO" | ...,
      "prompt": { "system": "...", "user": "..." },
      "cot": "First reason this, then that, …",
      "answer": "C" | "A sedan car on a street." | {  },
      "metadata": { "frame_count": 32, "duration_s": 5.2 }
    }
    > > - Perception tasks: The answer field contains a structured JSON: > - boxes: List of bounding boxes. > - points_pos / points_neg: Annotated points (for segmentation). > - time_span: Start/end for video or spatio-temporal localization. > > Sampling Strategy: > > > - Uniform across eight tasks, preserving image/video proportions. > > - Stratified within each task to sample from easy to hard instances proportionally. > > - No explicit mathematical weighting is employed in SFT subset selection. > > ## 5. Corpus Statistics and Annotation Characteristics > > Summary statistics: > > > | Corpus Split | Total Samples | Images | Videos | > |----------------------------|--------------|--------|--------| > | OneThinker-600k | 600,000 | 300k | 300k | > | OneThinker-SFT-340k (CoT) | 340,000 | 170k | 170k | > > Annotation length: > > > - <think>: ~60–80 tokens (8–12 reasoning steps). > > - <answer>: ~15–25 tokens for text tasks; variable-length structured JSON for perception tasks. > > CoT complexity: > > > - QA examples: 5–10 discrete steps. > > - Grounding/tracking: 4–6 spatial/temporal inference steps. > > Representative sample (Image QA): > > >
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    
    {
      "id":"IMG_0421",
      "modality":"image",
      "task":"image_qa",
      "source":"MMMU",
      "prompt":{
        "system":"You are a helpful visual reasoning assistant…",
        "user":"Q: What is the color of the umbrella? A. red B. blue C. green D. yellow"
      },
      "cot":"<think> I see a person holding an umbrella… the canopy is light blue… </think>",
      "answer":"B"
    }
    > Representative sample (Video spatio-temporal grounding): > > >
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    
    {
      "id":"VID_1234",
      "modality":"video",
      "task":"spatio_temporal_grounding",
      "source":"STVG",
      "prompt":{"system":"", "user":"Locate when and where the dog jumps over the fence."},
      "cot":"<think> At 00:02 the dog crouches, by 00:03 it leaps… </think>",
      "answer":{
        "time_span":[2.0,3.5],
        "boxes":[ [45,60,120,200], [48,62,123,205],  ]
      }
    }
    > > ## 6. Evaluation Metrics and Formalisms > > Annotation and downstream training employed formal reward and metric functions: > > - Overall RL reward per rollout: R=Racc+RformatR = R_{acc} + R_{format} > > - Open-ended reward via LM: Racc=RM(q,y^,a)R_{acc} = RM(q, \hat{y}, a) > > - Temporal Intersection-over-Union (IoU): Racc=tIoU([s^,e^],[s,e])R_{acc} = \text{tIoU}([\hat{s}, \hat{e}], [s, e]) > > - Spatial IoU: Racc=sIoU(b^,b)R_{acc} = \text{sIoU}(\hat{b}, b) > > - Segmentation reward (image): Racc=sIoU(b^,b)+G(dpos)+G(dneg)R_{acc} = \text{sIoU}(\hat{b}, b) + G(d_{pos}) + G(d_{neg}), with G(d)=exp(d22σ2)G(d) = \exp\left(-\frac{d^2}{2 \sigma^2}\right), σ=50\sigma=50 px. > > - EMA-GRPO normalization: m1τ(t)=βm1τ(t1)+(1β)μτ(t)m_1^\tau(t) = \beta m_1^\tau(t-1) + (1-\beta) \mu^\tau(t), m2τ(t)=βm2τ(t1)+(1β)ντ(t)m_2^\tau(t) = \beta m_2^\tau(t-1) + (1-\beta) \nu^\tau(t), στ(t)=m2τ(m1τ)2\sigma^\tau(t) = \sqrt{m_2^\tau - (m_1^\tau)^2} > > These metrics are integral in both reward modeling for annotation filtering (via LM scoring) and as objective targets for RL optimization. > > ## 7. Significance and Applications > > The OneThinker-600k corpus is foundational for training and evaluating unified multimodal reasoning models capable of fluently transferring knowledge across disparate visual/temporal tasks. Its step-by-step reasoning traces and rigorous task coverage enable benchmarking and analysis of generalist models on 31 visual understanding benchmarks spanning 10 core tasks (Feng et al., 2 Dec 2025). The structured, schema-compliant data aligns with recent advances in automatic evaluation and RL-based finetuning for MLLMs. The publicly released code, model, and data facilitate transparency and reproducibility in the research community. > > No schematic diagrams for data construction beyond standard prompt/template illustrations appear in the main text. No further statistics or workflow thresholds are provided beyond those noted above.
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to OneThinker-600k Training Corpus.