Papers
Topics
Authors
Recent
2000 character limit reached

Monet-SFT-125K: Multimodal CoT Dataset

Updated 28 November 2025
  • Monet-SFT-125K is a large-scale multimodal chain-of-thought dataset containing 125K rigorously curated image–text–CoT triples for supervising latent visual reasoning.
  • It employs a three-stage filtration and annotation pipeline to ensure necessity, correctness, and token-level alignment for robust model training.
  • The dataset underpins visual-latent policy optimization and supports diverse applications including chart reading, geometry, OCR, and 3D object counting.

Monet-SFT-125K is a large-scale, high-fidelity, multimodal chain-of-thought (CoT) dataset designed to supervise supervised fine-tuning (SFT) and reinforcement learning pipelines for latent visual reasoning in multimodal LLMs (MLLMs). It consists of 125,000 rigorously curated image–text–CoT triples spanning abstract chart reading, real-world scene understanding, optical character recognition (OCR), geometry problems, and 3D object-counting tasks. Developed as part of the Monet framework, the dataset underpins explicit latent embedding alignment and policy optimization objectives for inductive reasoning well beyond conventional image-language modeling (Wang et al., 26 Nov 2025).

1. Composition and Problem Domains

Monet-SFT-125K aggregates data from six precursor sources, segmented into four key application domains: chart-only (ReFocus), geometry/sketch (Zebra-CoT geometry), 3D counting (Zebra-CoT count), and mixed real-world/OCR/chart reasoning (CogCoM, Visual-CoT, Zebra-CoT search).

The per-source breakdown is summarized as follows:

Source Problem Domain Count (K)
ReFocus Chart reading 0.4
CogCoM Real-world / Charts 0.5
Visual-CoT Real-world / Documents (OCR) / Charts 118.6
Zebra-CoT (search) Real-world / Documents / Charts 2.7
Zebra-CoT (geometry) Geometry problems 0.1
Zebra-CoT (count) 3D object counting 2.9
Total 125

Chart-only and geometry tasks offer tightly constrained reasoning, while mixed domains introduce increased complexity and diversity. This grouping supports robust model transfer in both in-distribution and out-of-distribution visual reasoning.

2. Filtration and Annotation Pipeline

Monet-SFT-125K was constructed using a three-stage pipeline:

Stage 1: Necessity Filtering

All candidate CoT samples were sourced from precursor datasets. The base MLLM, Qwen2.5-VL-7B, was executed on each original question–image pair. Samples answerable without auxiliary images were filtered out, ensuring all retained tasks necessitate multi-step visual inference.

Stage 2: Correctness Filtering

For candidates passing Stage 1, the same model was prompted with only the chain-of-thought's auxiliary image segments (excluding the main question image). Examples were retained only if correct answers were produced, verifying each auxiliary visual's sufficiency and utility.

Stage 3: Token-Level Supervision

Retained samples underwent dual multimodal judgment (DeepSeek-V3.1, Gemini 2.5 Pro), with both annotators marking indispensable text tokens generated from direct visual observation. Marked tokens are wrapped with <observation>…</observation> tags, serving as hard semantic anchors for latent-alignment supervision.

Quality-control procedures enforced both necessity (Stage 1) and sufficiency (Stage 2) of each reasoning chain, eliminating trivial and noisy cases and giving rise to a dataset uniquely suited for latent observation alignment.

3. Data Structure and Representational Format

Each Monet-SFT-125K sample consists of a serialized interleaving of question text, image embeddings, auxiliary visual embeddings, and tagged text tokens, strictly in the format:

1
2
3
4
5
6
[Question Text]
[Question Image Embeddings]
<STEP-1 Auxiliary Visual Embedding>
<observation>…</observation> (text tokens)
…
\boxed{Answer}

Images are uniformly resized to 224×224 pixels and patch-embedded via Qwen2.5-VL’s vision encoder (standard grid sizes: 14×14, 28×28; patch embedding dimension typically 768). Thus, each visual segment comprises the token sequence VRM×dV \in \mathbb{R}^{M \times d}, where M=P2M = P^2.

Latent embedding segments are introduced during SFT: whenever the <latent> token is generated by the decoder, the corresponding hidden state h(t)Rdh^{(t)} \in \mathbb{R}^d is recycled as a latent embedding, repeated for a fixed length KK. At inference, these continuous embeddings stand in for actual visual clues.

Supervision leverages three loss terms:

  • Observation-alignment (Stage 2):

Lalign-obs=1Ni=1N=1L(1cos(hobs(i,),h^obs(i,)))\mathcal{L}_{\rm align\text{-}obs} = \frac{1}{N}\sum_{i=1}^N\sum_{\ell=1}^L \left( 1 - \cos(h^{*\,(i,\ell)}_{\rm obs}, \hat h^{(i,\ell)}_{\rm obs}) \right)

Gradients propagate exclusively through generated latent embeddings.

  • Latent-alignment (Stage 3):

Lalign-latent=1Ni=1N=1L(1cos(hlatent(i,),h^latent(i,)))\mathcal{L}_{\rm align\text{-}latent} = \frac{1}{N}\sum_{i=1}^N\sum_{\ell=1}^L \Bigl( 1 - \cos(h^{*\,(i,\ell)}_{\rm latent}, \hat h^{(i,\ell)}_{\rm latent} ) \Bigr)

LNTP=1Nitlogp(yty<t,(past images+latents))\mathcal{L}_{\rm NTP} = -\frac{1}{N}\sum_{i}\sum_{t} \log p( y_t \mid y_{<t},\,\text{(past images+latents)} )

4. Statistical Characteristics

All images are standardized to 224×224 resolution prior to embedding. Each sample averages approximately 10 reasoning steps (alternations of text and auxiliary images); the median is 8, with a variance of roughly 4. Given step counts nin_i, these are calculated as: μ=1Nini,σ2=1Ni(niμ)2\mu = \frac{1}{N} \sum_i n_i,\quad \sigma^2 = \frac{1}{N} \sum_i ( n_i - \mu )^2

Latent reasoning length KK is evaluated at values {8, 10, 12, 16}, with single-stage SFT optimal at K=8K = 8 and full RL+VLPO at K=10K = 10.

Complexity ranges from single-lookups and chart crops to multi-step geometric sketch construction and 3D object removal, supporting generalization to abstract and real-world tasks.

5. Intended Applications and Empirical Benchmarks

Monet-SFT-125K is designed to enable high-fidelity supervision of latent embedding generation and alignment during SFT. It directly supports downstream Visual-latent Policy Optimization (VLPO), a reinforcement learning algorithm for policy gradient updates leveraging latent visual embeddings.

Performance of Monet-7B, trained using Monet-SFT-125K and VLPO, is reported on standard multimodal reasoning benchmarks:

  • V*: Fine-grained visual search
  • HRBench4K/HRBench8K: Chart and spatial reasoning
  • MME-RealWorld-Lite: Mixed perception + reasoning
  • VisualPuzzles: Out-of-distribution abstract logic

The accuracy results for key baselines are tabulated below:

Model V* Accuracy VisualPuzzles Accuracy
Qwen2.5-VL-7B 76.44% 32.71%
+ vanilla SFT 81.68% 33.99%
+ SFT+GRPO 78.53% 30.99%
Deepeyes 83.25% 32.96%
Monet-7B (SFT+VLPO) 83.25% 35.02%

The consistent gains for Monet-7B on both in-distribution perception and abstract out-of-distribution reasoning tasks suggest Monet-SFT-125K’s approach provides effective supervision for visual latent reasoning (Wang et al., 26 Nov 2025).

6. Licensing and Accessibility

Monet-SFT-125K, together with the Monet framework, trained models, and RL recipes, is available under the permissive Apache 2.0 license at https://github.com/NOVAglow646/Monet. No non-commercial or geographic restrictions apply beyond standard Apache 2.0 terms.

A plausible implication is that Monet-SFT-125K can be freely adopted for further research in latent visual reasoning, multimodal chain-of-thought supervision, and downstream RL applications within the constraints of the cited benchmarks and domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Monet-SFT-125K Dataset.