Monet-SFT-125K: Multimodal CoT Dataset
- Monet-SFT-125K is a large-scale multimodal chain-of-thought dataset containing 125K rigorously curated image–text–CoT triples for supervising latent visual reasoning.
- It employs a three-stage filtration and annotation pipeline to ensure necessity, correctness, and token-level alignment for robust model training.
- The dataset underpins visual-latent policy optimization and supports diverse applications including chart reading, geometry, OCR, and 3D object counting.
Monet-SFT-125K is a large-scale, high-fidelity, multimodal chain-of-thought (CoT) dataset designed to supervise supervised fine-tuning (SFT) and reinforcement learning pipelines for latent visual reasoning in multimodal LLMs (MLLMs). It consists of 125,000 rigorously curated image–text–CoT triples spanning abstract chart reading, real-world scene understanding, optical character recognition (OCR), geometry problems, and 3D object-counting tasks. Developed as part of the Monet framework, the dataset underpins explicit latent embedding alignment and policy optimization objectives for inductive reasoning well beyond conventional image-language modeling (Wang et al., 26 Nov 2025).
1. Composition and Problem Domains
Monet-SFT-125K aggregates data from six precursor sources, segmented into four key application domains: chart-only (ReFocus), geometry/sketch (Zebra-CoT geometry), 3D counting (Zebra-CoT count), and mixed real-world/OCR/chart reasoning (CogCoM, Visual-CoT, Zebra-CoT search).
The per-source breakdown is summarized as follows:
| Source | Problem Domain | Count (K) |
|---|---|---|
| ReFocus | Chart reading | 0.4 |
| CogCoM | Real-world / Charts | 0.5 |
| Visual-CoT | Real-world / Documents (OCR) / Charts | 118.6 |
| Zebra-CoT (search) | Real-world / Documents / Charts | 2.7 |
| Zebra-CoT (geometry) | Geometry problems | 0.1 |
| Zebra-CoT (count) | 3D object counting | 2.9 |
| Total | — | 125 |
Chart-only and geometry tasks offer tightly constrained reasoning, while mixed domains introduce increased complexity and diversity. This grouping supports robust model transfer in both in-distribution and out-of-distribution visual reasoning.
2. Filtration and Annotation Pipeline
Monet-SFT-125K was constructed using a three-stage pipeline:
Stage 1: Necessity Filtering
All candidate CoT samples were sourced from precursor datasets. The base MLLM, Qwen2.5-VL-7B, was executed on each original question–image pair. Samples answerable without auxiliary images were filtered out, ensuring all retained tasks necessitate multi-step visual inference.
Stage 2: Correctness Filtering
For candidates passing Stage 1, the same model was prompted with only the chain-of-thought's auxiliary image segments (excluding the main question image). Examples were retained only if correct answers were produced, verifying each auxiliary visual's sufficiency and utility.
Stage 3: Token-Level Supervision
Retained samples underwent dual multimodal judgment (DeepSeek-V3.1, Gemini 2.5 Pro), with both annotators marking indispensable text tokens generated from direct visual observation. Marked tokens are wrapped with <observation>…</observation> tags, serving as hard semantic anchors for latent-alignment supervision.
Quality-control procedures enforced both necessity (Stage 1) and sufficiency (Stage 2) of each reasoning chain, eliminating trivial and noisy cases and giving rise to a dataset uniquely suited for latent observation alignment.
3. Data Structure and Representational Format
Each Monet-SFT-125K sample consists of a serialized interleaving of question text, image embeddings, auxiliary visual embeddings, and tagged text tokens, strictly in the format:
1 2 3 4 5 6 |
[Question Text]
[Question Image Embeddings]
<STEP-1 Auxiliary Visual Embedding>
<observation>…</observation> (text tokens)
…
\boxed{Answer} |
Images are uniformly resized to 224×224 pixels and patch-embedded via Qwen2.5-VL’s vision encoder (standard grid sizes: 14×14, 28×28; patch embedding dimension typically 768). Thus, each visual segment comprises the token sequence , where .
Latent embedding segments are introduced during SFT: whenever the <latent> token is generated by the decoder, the corresponding hidden state is recycled as a latent embedding, repeated for a fixed length . At inference, these continuous embeddings stand in for actual visual clues.
Supervision leverages three loss terms:
- Observation-alignment (Stage 2):
Gradients propagate exclusively through generated latent embeddings.
- Latent-alignment (Stage 3):
- Next-token prediction (NTP):
4. Statistical Characteristics
All images are standardized to 224×224 resolution prior to embedding. Each sample averages approximately 10 reasoning steps (alternations of text and auxiliary images); the median is 8, with a variance of roughly 4. Given step counts , these are calculated as:
Latent reasoning length is evaluated at values {8, 10, 12, 16}, with single-stage SFT optimal at and full RL+VLPO at .
Complexity ranges from single-lookups and chart crops to multi-step geometric sketch construction and 3D object removal, supporting generalization to abstract and real-world tasks.
5. Intended Applications and Empirical Benchmarks
Monet-SFT-125K is designed to enable high-fidelity supervision of latent embedding generation and alignment during SFT. It directly supports downstream Visual-latent Policy Optimization (VLPO), a reinforcement learning algorithm for policy gradient updates leveraging latent visual embeddings.
Performance of Monet-7B, trained using Monet-SFT-125K and VLPO, is reported on standard multimodal reasoning benchmarks:
- V*: Fine-grained visual search
- HRBench4K/HRBench8K: Chart and spatial reasoning
- MME-RealWorld-Lite: Mixed perception + reasoning
- VisualPuzzles: Out-of-distribution abstract logic
The accuracy results for key baselines are tabulated below:
| Model | V* Accuracy | VisualPuzzles Accuracy |
|---|---|---|
| Qwen2.5-VL-7B | 76.44% | 32.71% |
| + vanilla SFT | 81.68% | 33.99% |
| + SFT+GRPO | 78.53% | 30.99% |
| Deepeyes | 83.25% | 32.96% |
| Monet-7B (SFT+VLPO) | 83.25% | 35.02% |
The consistent gains for Monet-7B on both in-distribution perception and abstract out-of-distribution reasoning tasks suggest Monet-SFT-125K’s approach provides effective supervision for visual latent reasoning (Wang et al., 26 Nov 2025).
6. Licensing and Accessibility
Monet-SFT-125K, together with the Monet framework, trained models, and RL recipes, is available under the permissive Apache 2.0 license at https://github.com/NOVAglow646/Monet. No non-commercial or geographic restrictions apply beyond standard Apache 2.0 terms.
A plausible implication is that Monet-SFT-125K can be freely adopted for further research in latent visual reasoning, multimodal chain-of-thought supervision, and downstream RL applications within the constraints of the cited benchmarks and domains.