Monet-SFT-125K: Visual Latent Reasoning Dataset

Updated 7 March 2026

Monet-SFT-125K is a curated dataset designed to enable robust visual latent reasoning by providing 125K high-quality chain-of-thought exemplars.
It features a multi-stage curation pipeline that filters for necessary, correct auxiliary images and applies detailed token-level observation annotations.
Integration into a three-stage supervised fine-tuning and reinforcement learning pipeline yields improved performance on both real-world and abstract visual reasoning benchmarks.

Monet-SFT-125K is a curated dataset designed to advance the capabilities of multimodal LLMs (MLLMs) in visual latent reasoning. It provides 125,000 high-quality chain-of-thought (CoT) exemplars interleaving image and text, with explicit supervision on the necessity, accuracy, and alignment of visual observations to drive robust training of visual-latent MLLMs. Monet-SFT-125K plays a central role in the Monet framework, supporting distillation-based supervised fine-tuning (SFT) pipelines and novel reinforcement learning techniques for tool-free, abstract visual reasoning tasks (Wang et al., 26 Nov 2025).

1. Motivation and Design Principles

Monet-SFT-125K was created to address limitations observed in prior image–text CoT datasets, which typically suffer from trivially solvable instances (where answers can be deduced without auxiliary images), noisy or inaccurate auxiliary images, and uniform treatment of all text tokens without considering the importance of those tied to visual observations. The primary goal is to construct a dataset in which auxiliary images are both indispensable and correct for reasoning, with fine-grained token-level annotations highlighting key visual observations. This supports model training regimes that require token–latent visual alignment and covers a broad range of domains, including real-world scenes, charts, documents, geometry, and 3D object counting.

2. Data Collection and Curation Pipeline

The curation process applies a structured, multi-stage pipeline to guarantee both necessity and correctness of visual steps, as well as precise observation-token annotation:

Stage 1: Raw-Data Filtering Samples are drawn from ReFocus (Chart), CogCoM (Real-world, Chart), Visual-CoT (Real-world, Documents, Chart), and Zebra-CoT (Search, Geometry, 3D Count). Only those where the base model (Qwen2.5-VL-7B) answers incorrectly given only the question plus original image are kept, ensuring auxiliary images are required.
Stage 2: Correctness Validation Samples are retained only if the Qwen2.5-VL-7B (2-block version) answers correctly when provided with auxiliary images alone, thereby filtering for accurate auxiliary images.
Stage 3: Observation-Token Annotation Two high-performing LLMs (DeepSeek-V3.1 and Gemini 2.5 Pro) identify minimal spans within the text describing essential visual observations. These tokens are wrapped with the <observation>…</observation> tag, serving as anchors for loss alignment in SFT. Guidelines preclude adding new reasoning steps or redundant annotation—only existing, image-dependent tokens are tagged, and spans are minimized for specificity.

3. Dataset Composition and Structure

Monet-SFT-125K consists of approximately 125,000 interleaved image–text CoT instances spanning five sub-domains and multiple data modalities:

Source	Domain(s)	Operation Types	Count
ReFocus	Chart	Bounding-box drawing, highlighting	400
CogCoM	Real-world, Chart	Cropping, auxiliary-line and bounding-box drawing	500
Visual-CoT	Real-world, Documents, Chart	Cropping, bounding-box drawing	118,600
Zebra-CoT (Search)	Real-world, Documents, Chart	Cropping, bounding-box drawing	2,700
Zebra-CoT (Geometry)	Geometry	Auxiliary-line drawing, geometric sketching	100
Zebra-CoT (3D Count)	3D Object Counting	Object state modification (removal/addition)	2,900

Included modalities comprise photographs, infographic charts, document scans, and synthetic geometric sketches. Each instance follows a standardized format: a question and initial image, a sequence of paired text and auxiliary images (with observation-tagged tokens), and a final answer given as a LaTeX-style $\boxed{\cdots}$ expression.

Example Structures

Cropping Example
- Question: “According to the infographic, what percentage of parents delay the Varicella vaccine?”
- [Full chart image]
- Auxiliary (Crop): [“Varicella” row]
- Reasoning: The cropped image shows the specific data point for the <observation>Varicella vaccine</observation>…
- Answer: $\boxed{44\%}$
Geometry Example
- Question: “What is the count of orange objects after successive removals?”
- [Initial shapes]
- Auxiliary (Sketch): after removing pyramids
- Reasoning: ... left with <observation>five objects</observation>: an <observation>orange sphere</observation>, …
- Answer: $\boxed{1}$

4. Quality Control and Dataset Filtering

The dataset’s formal structure and admission criteria are rigorously defined. Let $S_0$ denote the raw interleaved-CoT sample set:

Necessity Filter:

$S_1 = \{s \in S_0\,|\,\text{Model}(s.\text{question} + s.\text{orig}_\text{image}) \neq \text{correct}\}$

Correctness Filter:

$S_2 = \{s \in S_1\,|\,\text{Model}(s.\text{aux}_\text{images}) = \text{correct}\}$

The final dataset, $S_\text{final} = S_2$ after token-level annotation.

No additional diversity metrics (such as $\ell_p$ norms or entropy measures) are applied or reported for dataset diversity. Annotation quality is achieved via LLM-based judging and human-crafted curation guidelines.

5. Integration into Monet’s Supervised Fine-Tuning Pipeline

Monet-SFT-125K is fundamentally integrated into a three-stage supervised fine-tuning (SFT) process underpinning the Monet framework:

Stage 1 (Warm-Up):

Standard next-token prediction on the dataset for 4 epochs ( $\text{LR}=1\mathrm{e}{-5}$ ; batch=1 $\times$ 16 accumulate; decay=0.01). Trains the model to attend to interleaved images.

Stage 2 (Latent–Observation Alignment):

Teacher model receives ground-truth auxiliary images; the student replaces these with generated fixed-length latent embeddings. Auxiliary-image embeddings attend only to latents, not future text (controlled attention mask). The loss is:

$\boxed{44\%}$ 0

where

$\boxed{44\%}$ 1

Gradients from $\boxed{44\%}$ 2 back-propagate only through latents (latent-only BP).

Stage 3 (Latent Generation without Images):

The model is retrained without any ground-truth images. Generated latent representations are aligned to targets from Stage 2:

$\boxed{44\%}$ 3

$\boxed{44\%}$ 4

This stage enables the model to produce semantically meaningful, image-equivalent latents directly from text, without external tools or images.

6. Empirical Impact and Ablation Studies

Vanilla SFT on Monet-SFT-125K yields a 5–7 percentage point improvement over Qwen2.5-VL-7B on challenging real-world and abstract visual reasoning benchmarks (V*, HRBench4K/8K, MME-RealWorld). Full three-stage Monet-SFT further adds 1–3 points over vanilla SFT. Monet’s VLPO (Visual-latent Policy Optimization) reinforcement stage provides an additional 1–2 point gain and demonstrates out-of-distribution generalization, e.g., 35.02% accuracy on the OOD VisualPuzzles benchmark (vs. 33.99% for vanilla SFT, 32.71% for base model) (Wang et al., 26 Nov 2025).

Critical ablations in SFT Stage 2 establish the necessity of all major design elements:

Excluding latent-only backpropagation collapses accuracy (82.2% $\boxed{44\%}$ 5 46.1% on V*).
Disabling auxiliary-image to latent attention flow leads to a 9-point drop (82.2% $\boxed{44\%}$6 73.3%).
Omitting observation-token alignment reduces accuracy to 75.4%.

These results underscore the importance of both fine-grained observation annotation and latent-alignment supervision in learning functional visual representations.

7. Limitations and Prospective Directions

Monet-SFT-125K’s scalability is constrained by reliance on human-engineered curation pipelines and external LLM judges, rendering large-scale annotation resource-intensive. The dataset’s coverage remains limited to the five original sources, with notable domain gaps (e.g., medical or architectural imagery). There is no explicit metric reported for linguistic or visual diversity; future work may incorporate embedding-space diversity assessments. Current annotations only mark positive (supporting) examples, and the addition of negative or distractor images is suggested as a means to improve robustness. The reinforcement learning stage’s reward design is, at present, limited to answer accuracy and formatting; further investigations could extend to penalizing spurious latents or optimizing for more nuanced latent representation properties.

Overall, Monet-SFT-125K represents a pivotal resource enabling MLLMs to perform chain-of-thought reasoning with semantically rich, tool-independent latent visual embeddings, and has set a new standard for fine-grained, image–text interleaved supervision in MLLM training pipelines (Wang et al., 26 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Monet: Reasoning in Latent Visual Space Beyond Images and Language (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Monet-SFT-125K.