Monet-SFT-125K: Multimodal CoT Dataset

Updated 28 November 2025

Monet-SFT-125K is a large-scale multimodal chain-of-thought dataset containing 125K rigorously curated image–text–CoT triples for supervising latent visual reasoning.
It employs a three-stage filtration and annotation pipeline to ensure necessity, correctness, and token-level alignment for robust model training.
The dataset underpins visual-latent policy optimization and supports diverse applications including chart reading, geometry, OCR, and 3D object counting.

Monet-SFT-125K is a large-scale, high-fidelity, multimodal chain-of-thought (CoT) dataset designed to supervise supervised fine-tuning (SFT) and reinforcement learning pipelines for latent visual reasoning in multimodal LLMs (MLLMs). It consists of 125,000 rigorously curated image–text–CoT triples spanning abstract chart reading, real-world scene understanding, optical character recognition (OCR), geometry problems, and 3D object-counting tasks. Developed as part of the Monet framework, the dataset underpins explicit latent embedding alignment and policy optimization objectives for inductive reasoning well beyond conventional image-language modeling (Wang et al., 26 Nov 2025).

1. Composition and Problem Domains

Monet-SFT-125K aggregates data from six precursor sources, segmented into four key application domains: chart-only (ReFocus), geometry/sketch (Zebra-CoT geometry), 3D counting (Zebra-CoT count), and mixed real-world/OCR/chart reasoning (CogCoM, Visual-CoT, Zebra-CoT search).

The per-source breakdown is summarized as follows:

Source	Problem Domain	Count (K)
ReFocus	Chart reading	0.4
CogCoM	Real-world / Charts	0.5
Visual-CoT	Real-world / Documents (OCR) / Charts	118.6
Zebra-CoT (search)	Real-world / Documents / Charts	2.7
Zebra-CoT (geometry)	Geometry problems	0.1
Zebra-CoT (count)	3D object counting	2.9
Total	—	125

Chart-only and geometry tasks offer tightly constrained reasoning, while mixed domains introduce increased complexity and diversity. This grouping supports robust model transfer in both in-distribution and out-of-distribution visual reasoning.

2. Filtration and Annotation Pipeline

Monet-SFT-125K was constructed using a three-stage pipeline:

Stage 1: Necessity Filtering

All candidate CoT samples were sourced from precursor datasets. The base MLLM, Qwen2.5-VL-7B, was executed on each original question–image pair. Samples answerable without auxiliary images were filtered out, ensuring all retained tasks necessitate multi-step visual inference.

Stage 2: Correctness Filtering

For candidates passing Stage 1, the same model was prompted with only the chain-of-thought's auxiliary image segments (excluding the main question image). Examples were retained only if correct answers were produced, verifying each auxiliary visual's sufficiency and utility.

Stage 3: Token-Level Supervision

Retained samples underwent dual multimodal judgment (DeepSeek-V3.1, Gemini 2.5 Pro), with both annotators marking indispensable text tokens generated from direct visual observation. Marked tokens are wrapped with <observation>…</observation> tags, serving as hard semantic anchors for latent-alignment supervision.

Quality-control procedures enforced both necessity (Stage 1) and sufficiency (Stage 2) of each reasoning chain, eliminating trivial and noisy cases and giving rise to a dataset uniquely suited for latent observation alignment.

3. Data Structure and Representational Format

Each Monet-SFT-125K sample consists of a serialized interleaving of question text, image embeddings, auxiliary visual embeddings, and tagged text tokens, strictly in the format:

[Question Text]
[Question Image Embeddings]
<STEP-1 Auxiliary Visual Embedding>
<observation>…</observation> (text tokens)
…
\boxed{Answer}

Images are uniformly resized to 224×224 pixels and patch-embedded via Qwen2.5-VL’s vision encoder (standard grid sizes: 14×14, 28×28; patch embedding dimension typically 768). Thus, each visual segment comprises the token sequence $V \in \mathbb{R}^{M \times d}$ , where $M = P^2$ .

Latent embedding segments are introduced during SFT: whenever the <latent> token is generated by the decoder, the corresponding hidden state $h^{(t)} \in \mathbb{R}^d$ is recycled as a latent embedding, repeated for a fixed length $K$ . At inference, these continuous embeddings stand in for actual visual clues.

Supervision leverages three loss terms:

Observation-alignment (Stage 2):

$\mathcal{L}_{\rm align\text{-}obs} = \frac{1}{N}\sum_{i=1}^N\sum_{\ell=1}^L \left( 1 - \cos(h^{*\,(i,\ell)}_{\rm obs}, \hat h^{(i,\ell)}_{\rm obs}) \right)$

Gradients propagate exclusively through generated latent embeddings.

Latent-alignment (Stage 3):

$\mathcal{L}_{\rm align\text{-}latent} = \frac{1}{N}\sum_{i=1}^N\sum_{\ell=1}^L \Bigl( 1 - \cos(h^{*\,(i,\ell)}_{\rm latent}, \hat h^{(i,\ell)}_{\rm latent} ) \Bigr)$

Next-token prediction (NTP):

$\mathcal{L}_{\rm NTP} = -\frac{1}{N}\sum_{i}\sum_{t} \log p( y_t \mid y_{<t},\,\text{(past images+latents)} )$

4. Statistical Characteristics

All images are standardized to 224×224 resolution prior to embedding. Each sample averages approximately 10 reasoning steps (alternations of text and auxiliary images); the median is 8, with a variance of roughly 4. Given step counts $n_i$ , these are calculated as: $\mu = \frac{1}{N} \sum_i n_i,\quad \sigma^2 = \frac{1}{N} \sum_i ( n_i - \mu )^2$

Latent reasoning length $K$ is evaluated at values {8, 10, 12, 16}, with single-stage SFT optimal at $K = 8$ and full RL+VLPO at $K = 10$ .

Complexity ranges from single-lookups and chart crops to multi-step geometric sketch construction and 3D object removal, supporting generalization to abstract and real-world tasks.

5. Intended Applications and Empirical Benchmarks

Monet-SFT-125K is designed to enable high-fidelity supervision of latent embedding generation and alignment during SFT. It directly supports downstream Visual-latent Policy Optimization (VLPO), a reinforcement learning algorithm for policy gradient updates leveraging latent visual embeddings.

Performance of Monet-7B, trained using Monet-SFT-125K and VLPO, is reported on standard multimodal reasoning benchmarks:

V*: Fine-grained visual search
HRBench4K/HRBench8K: Chart and spatial reasoning
MME-RealWorld-Lite: Mixed perception + reasoning
VisualPuzzles: Out-of-distribution abstract logic

The accuracy results for key baselines are tabulated below:

Model	V* Accuracy	VisualPuzzles Accuracy
Qwen2.5-VL-7B	76.44%	32.71%
+ vanilla SFT	81.68%	33.99%
+ SFT+GRPO	78.53%	30.99%
Deepeyes	83.25%	32.96%
Monet-7B (SFT+VLPO)	83.25%	35.02%

The consistent gains for Monet-7B on both in-distribution perception and abstract out-of-distribution reasoning tasks suggest Monet-SFT-125K’s approach provides effective supervision for visual latent reasoning (Wang et al., 26 Nov 2025).

6. Licensing and Accessibility

Monet-SFT-125K, together with the Monet framework, trained models, and RL recipes, is available under the permissive Apache 2.0 license at https://github.com/NOVAglow646/Monet. No non-commercial or geographic restrictions apply beyond standard Apache 2.0 terms.

A plausible implication is that Monet-SFT-125K can be freely adopted for further research in latent visual reasoning, multimodal chain-of-thought supervision, and downstream RL applications within the constraints of the cited benchmarks and domains.

PDF Markdown Chat (Pro)

References (1)

Monet: Reasoning in Latent Visual Space Beyond Images and Language (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Monet-SFT-125K Dataset.