OneThinker-600k Multimodal Dataset
- OneThinker-600k is a large-scale multimodal dataset comprising 600,000 annotated samples from both images and videos for visual reasoning tasks.
- It covers eight core tasks such as rule-based and open-ended QA, captioning, grounding, tracking, and segmentation to support comprehensive evaluation.
- The dataset employs a systematic chain-of-thought annotation pipeline with rigorous quality filtering to boost the performance of unified multimodal models.
OneThinker-600k, constructed for the unified multimodal visual reasoning model OneThinker, is a large-scale dataset comprising 600,000 annotated samples designed to enable joint training across image and video tasks. It was developed to support generalist multimodal LLMs (MLLMs) for eight fundamental types of visual reasoning, with a systematic chain-of-thought (CoT) annotation pipeline. The corpus underpins the supervised fine-tuning of the OneThinker reasoning model and advances research toward unified, task- and modality-agnostic multimodal understanding (Feng et al., 2 Dec 2025).
1. Corpus Composition and Task Coverage
The OneThinker-600k corpus contains 600,000 multimodal samples, split approximately evenly between static images and short video clips.
Task Coverage:
- Rule-based Question Answering (QA): Multiple-choice, numerical, regression, math, OCR.
- Open-ended QA: General and scientific queries.
- Captioning: Free-form descriptions for both images and videos.
- Spatial Grounding: Identifying box coordinates or point locations in images.
- Temporal Grounding: Locating relevant time intervals in videos.
- Spatio-temporal Grounding: Simultaneous spatial and temporal localization.
- Tracking: Identification of object trajectory across frames.
- Segmentation: Region-based annotation for both image and video modalities.
The dataset’s construction did not specify per-task instance counts in tabular form; only the aggregation totaling 600,000 examples is reported. Empirically, a balance of roughly 50,000–100,000 samples per fundamental task is required to span this union.
2. Data Sources and Provenance
Samples in OneThinker-600k are selected from a diverse set of benchmarks, each addressing distinct facets of visual reasoning:
| Task | Data Sources (Original Benchmarks) |
|---|---|
| Rule-based QA | MMMU, MathVista, MathVerse, MMBench, MMStar, ScienceQA, AI2D, MMT-Bench |
| Open-ended QA | ScienceQA and similar open-ended QA sets |
| Image Captioning | MMSci-Caption, MMT-Caption |
| Video Captioning | VideoMMLU-Caption |
| Spatial Grounding | RefCOCO, RefCOCO+, RefCOCOg |
| Temporal Grounding | Charades (TALL), ActivityNet (dense-caption events), ANet-RTL |
| Spatio-temporal Grounding | STVG |
| Tracking | GOT-10k |
| Segmentation | RefCOCO series, MeViS, ReasonVOS; region masks via SAM2 |
Integration across these sources enables the dataset to support multitask, multimodal reasoning research with broad empirical coverage.
3. CoT Annotation Workflow and Filtering
Annotation involved generating detailed chain-of-thought rationales using a proprietary multimodal model (“Seed1.5-VL”). The pipeline followed a uniform prompt template:
- System-level Prompt: Instructs outputs in
>(step-by-step rationale) and<answer>(final prediction or structured result) blocks. > > - Task-specific Template: For QA: “You are given an image/video and a question. First THINK out loud step by step, then ANSWER.” > > - Perception Tasks:<answer>block constrained to a JSON schema for grounding, tracking, and segmentation responses. > > The annotation pipeline incorporated multi-stage quality control: > > - For rule-based tasks (QA, grounding, tracking): automatic validation of answer format, with filtering for invalid or demonstrably incorrect outputs. > > - Open-ended QA/Captioning: An external reward model (POLAR-7B) scored each output; only those with similarity score SRM ≥ 0.7 were retained (the precise threshold is not specified, but a 0.7–0.8 range is typical). > > - All tasks: Chain-of-thought length required to be ≥20 tokens for valid inclusion. > > The resultant high-quality subset after filtering is designated OneThinker-SFT-340k, containing 340,000 rigorously CoT-annotated examples. > > ## 4. Dataset Structure, Sampling, and File Formats > > OneThinker-600k and its SFT-340k subset are stored as newline-delimited JSON (NDJSON), each sample as an object including: > >
> > - Perception tasks: The1 2 3 4 5 6 7 8 9 10
{ "id": "IMG_0001234", "modality": "image" | "video", "task": "image_qa" | "video_caption" | ..., "source": "MMBench" | "RefCOCO" | ..., "prompt": { "system": "...", "user": "..." }, "cot": "First reason this, then that, …", "answer": "C" | "A sedan car on a street." | { … }, "metadata": { "frame_count": 32, "duration_s": 5.2 } }answerfield contains a structured JSON: > -boxes: List of bounding boxes. > -points_pos/points_neg: Annotated points (for segmentation). > -time_span: Start/end for video or spatio-temporal localization. > > Sampling Strategy: > > > - Uniform across eight tasks, preserving image/video proportions. > > - Stratified within each task to sample from easy to hard instances proportionally. > > - No explicit mathematical weighting is employed in SFT subset selection. > > ## 5. Corpus Statistics and Annotation Characteristics > > Summary statistics: > > > | Corpus Split | Total Samples | Images | Videos | > |----------------------------|--------------|--------|--------| > | OneThinker-600k | 600,000 | 300k | 300k | > | OneThinker-SFT-340k (CoT) | 340,000 | 170k | 170k | > > Annotation length: > > > -<think>: ~60–80 tokens (8–12 reasoning steps). > > -<answer>: ~15–25 tokens for text tasks; variable-length structured JSON for perception tasks. > > CoT complexity: > > > - QA examples: 5–10 discrete steps. > > - Grounding/tracking: 4–6 spatial/temporal inference steps. > > Representative sample (Image QA): > > >
> Representative sample (Video spatio-temporal grounding): > > >1 2 3 4 5 6 7 8 9 10 11 12
{ "id":"IMG_0421", "modality":"image", "task":"image_qa", "source":"MMMU", "prompt":{ "system":"You are a helpful visual reasoning assistant…", "user":"Q: What is the color of the umbrella? A. red B. blue C. green D. yellow" }, "cot":"<think> I see a person holding an umbrella… the canopy is light blue… </think>", "answer":"B" }
> > ## 6. Evaluation Metrics and Formalisms > > Annotation and downstream training employed formal reward and metric functions: > > - Overall RL reward per rollout: > > - Open-ended reward via LM: > > - Temporal Intersection-over-Union (IoU): > > - Spatial IoU: > > - Segmentation reward (image): , with , px. > > - EMA-GRPO normalization: , , > > These metrics are integral in both reward modeling for annotation filtering (via LM scoring) and as objective targets for RL optimization. > > ## 7. Significance and Applications > > The OneThinker-600k corpus is foundational for training and evaluating unified multimodal reasoning models capable of fluently transferring knowledge across disparate visual/temporal tasks. Its step-by-step reasoning traces and rigorous task coverage enable benchmarking and analysis of generalist models on 31 visual understanding benchmarks spanning 10 core tasks (Feng et al., 2 Dec 2025). The structured, schema-compliant data aligns with recent advances in automatic evaluation and RL-based finetuning for MLLMs. The publicly released code, model, and data facilitate transparency and reproducibility in the research community. > > No schematic diagrams for data construction beyond standard prompt/template illustrations appear in the main text. No further statistics or workflow thresholds are provided beyond those noted above.1 2 3 4 5 6 7 8 9 10 11 12
{ "id":"VID_1234", "modality":"video", "task":"spatio_temporal_grounding", "source":"STVG", "prompt":{"system":"…", "user":"Locate when and where the dog jumps over the fence."}, "cot":"<think> At 00:02 the dog crouches, by 00:03 it leaps… </think>", "answer":{ "time_span":[2.0,3.5], "boxes":[ [45,60,120,200], [48,62,123,205], … ] } }