OneThinker-600k Dataset Overview

Updated 9 December 2025

OneThinker-600k dataset is a comprehensive multimodal resource that supports unified visual reasoning across both image and video modalities.
It employs a rigorous annotation pipeline using chain-of-thought generation to achieve high-fidelity and balanced coverage of eight fundamental visual tasks.
The dataset facilitates both supervised fine-tuning and reinforcement learning with reward normalization to stabilize diverse task rewards.

The OneThinker-600k dataset is a large-scale multimodal corpus curated to support the development and training of unified visual reasoning models that operate across both image and video modalities. Designed as the foundational resource for the OneThinker architecture, OneThinker-600k encompasses a balanced suite of fundamental visual tasks, enabling joint learning and evaluation in domains that historically required separate models or datasets. The corpus is distinguished by its comprehensive task coverage, rigorous annotation pipeline, and design attention to reward normalization and distributional balance, facilitating effective supervised and reinforcement learning paradigms for multimodal LLMs (MLLMs) (Feng et al., 2 Dec 2025).

1. Dataset Composition and Modalities

OneThinker-600k comprises approximately 600,000 multimodal examples, with a deliberate 1:1 proportional split between images and video clips. The dataset spans eight fundamental visual understanding tasks, sampled to achieve near-uniform coverage of both task types and modalities. A subset, OneThinker-SFT-340k, consisting of ~340,000 chain-of-thought (CoT) annotated examples, is designated for supervised fine-tuning (SFT) cold start and represents over 56% of the total pool after quality filtering.

Task	Modality	Approx. # Samples
Rule-based QA	Image+Video	~75,000
Open-ended QA & Captioning	Image+Video	~75,000
Spatial Grounding	Image	~75,000
Temporal Grounding	Video	~75,000
Spatio-Temporal Grounding	Video	~75,000
Tracking	Video	~75,000
Image Segmentation	Image	~75,000
Video Segmentation	Video	~75,000

This extensive and diverse task breakdown establishes OneThinker-600k as one of the most integrated multimodal reasoning corpora for both supervised and reinforcement learning research.

2. Data Sources and Preprocessing

All data in OneThinker-600k are sourced from established public benchmarks and datasets. For instance, QA examples are drawn from MMMU, MathVista, and ScienceQA; grounding uses RefCOCO-series data; tracking leverages GOT-10k; segmentation is sourced from MeViS and ReasonVOS; video QA is supported by VideoMMMU and VideoMME benchmarks. No proprietary or web-scraped image or video data are included.

Preprocessing steps involve:

Resizing image inputs to the resolution expected by the model’s vision backbone (224×224 or 336×336).
Video clips are uniformly sampled up to 128 frames; longer clips are truncated and shorter clips are upsampled or padded.
All inputs are normalized using ImageNet mean and standard deviation prior to encoder processing.
During training, each mini-batch contains an equal number of image and video samples.

These measures ensure homogeneity in the input structure, minimizing modality-induced bias and enabling effective joint optimization.

3. Annotation Pipeline

The annotation of OneThinker-600k is fully model-driven, utilizing a proprietary Seed1.5-VL model for chain-of-thought generation under a “think-then-answer” prompting strategy. Each example is processed using a system-level template that segments the chain-of-thought from the answer:

1 2	<think> ... chain-of-thought ... </think> <answer> ... task-specific JSON output ... </answer>

Task-specific prompts define the required answer schema, including:

QA: Multiple-choice verification, numeric tolerance enforcement, and similarity scoring using POLAR-7B for open-ended responses.
Grounding/Tracking: JSON schemas specifying bounding box or timestamp pairs.
Segmentation: JSON objects encoding bounding boxes and sets of positive/negative points.

Quality control is administered via rule-based format checks and task-specific accuracy thresholds (e.g., ≥90% match to ground-truth for multiple-choice QA annotation), resulting in post-filtering retention of 340,000 high-fidelity CoT examples and an empirical annotation error rate below 3%. Single-model annotation removes inter-annotator variance considerations.

4. Task Balancing and Reward Normalization

Sampling in both SFT and RL ensures each mini-batch contains an equal division of image and video data. Task types are rotated or sampled uniformly so that no additional weighting is necessary. This “Editor’s term”: balanced round-robin sampling prevents task or modality dominance in either phase.

For reinforcement learning, reward heterogeneity is mitigated by the introduction of EMA-GRPO (Exponentially Moving Average Group Reward Policy Optimization). Task-specific group statistics (means and variances) are maintained per RL step:

For task $\tau$ $τ$ at step $t$ $t$ :
- $\mu^\tau(t) = \mathrm{mean}(\{R_i\})$
- $\nu^\tau(t) = \mathrm{mean}(\{R_i^2\})$
- $m_1^\tau(t) = \beta \cdot m_1^\tau(t-1) + (1-\beta) \cdot \mu^\tau(t)$
- $m_2^\tau(t) = \beta \cdot m_2^\tau(t-1) + (1-\beta) \cdot \nu^\tau(t)$
- $\sigma^\tau(t) = \sqrt{m_2^\tau(t) - [m_1^\tau(t)]^2}$
- $A_i^\tau(t) = \frac{R_i - \mathrm{mean}(\{R_j\})}{\sigma^\tau(t)}$ , clipped to $[-5, +5]$

The RL objective incorporates these normalized advantages and a KL penalty:

$\mathbb{E}_{q, \{o_i\}}\left[\frac{1}{G} \sum_i \min(r_i A_i, \mathrm{clip}(r_i, 1-\epsilon, 1+\epsilon)A_i ) - \beta_{KL} \cdot D_{KL}(\pi_\theta \| \pi_{ref})\right]$

This normalization ensures stable policy updates across diverse tasks by addressing the variance in reward magnitude and dispersion.

5. Practical Usage and Implementation Protocol

Recommended usage of OneThinker-600k follows a two-stage process:

Stage 1: SFT Cold-Start Load the Qwen-3-VL-Instruct-8B checkpoint and OneThinker-SFT-340k subset; train with batch size 32, learning rate $1e\text{-}5$ , optimizer AdamW, and a response length cap of 4096 tokens.
Stage 2: RL Fine-Tuning Initialize from the SFT model using the full OneThinker-600k dataset; train with batch size 128, learning rate $2e\text{-}6$ , group size $G=8$ , $\beta_{KL}=0.01$ , and video frame cap at 128.

The input format requires prepending the system prompt, then the task prompt, and finally the modality-specific data (image tokens or video frame tokens). Model outputs adhere to the structured XML-style blocks > ... and <answer>...</answer>. For segmentation tasks, the JSON contained in <answer> is post-processed with SAM2 to synthesize the final mask.

6. Availability, Licensing, and Restrictions

The OneThinker-600k dataset, associated code, and trained models are publicly available via GitHub and HuggingFace repositories. Distribution and usage are governed by the Apache-2.0 license, with an additional non-commercial, research-only clause mandating citation of the OneThinker paper. Downstream redistribution under the same license is permitted, and users may train with proprietary data without the obligation to release derived datasets.

7. Context and Significance for Multimodal Reasoning Research

By synthesizing diverse task domains within a single, consistently annotated corpus spanning both image and video modalities, OneThinker-600k offers a basis for training and evaluating all-in-one reasoning models. Its design facilitates robust multi-task learning, effective cross-task and cross-modal transfer, and scalable generalization—a response to the prior fragmentation of dataset and model resources for multimodal reasoning. The explicit address of reward normalization and detailed attention to annotation fidelity establish this corpus as a resource for rigorous empirical study and benchmarking in contemporary MLLM development (Feng et al., 2 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

OneThinker: All-in-one Reasoning Model for Image and Video (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to OneThinker-600k Dataset.