Honey-Data-15M Multimodal SFT Dataset
- Honey-Data-15M is a large-scale multimodal supervised fine-tuning dataset comprising 15M QA pairs enriched with dual-level chain-of-thought annotations.
- It utilizes a modular curation pipeline with rule-based and model-based filtering to eliminate noise and ensure effective vision–language alignment.
- The dataset supports state-of-the-art training recipes and transparent evaluation, advancing open research in complex reasoning and instruction following for MLLMs.
Honey-Data-15M is a large-scale, high-quality multimodal supervised fine-tuning (SFT) dataset constructed to advance the capabilities of fully open multimodal LLMs (MLLMs), specifically in domains requiring complex reasoning, effective vision–language alignment, and robust instruction following. The corpus consists of approximately 15 million meticulously curated QA pairs enriched with dual-level chain-of-thought (CoT) annotations, and is supported by a modular data curation pipeline and full-stack suite for reproducible SFT research and deployment (Zhang et al., 15 Oct 2025).
1. Dataset Construction and Cleaning
The Honey-Data-15M corpus is explicitly designed to remedy known deficits in public SFT datasets—namely, the prevalence of low-quality, noisy samples, instruction–image mismatches, and a relative scarcity of extended reasoning data. Construction proceeds through a multi-step, model-driven and rule-based pipeline (HoneyPipe, built on DataStudio) incorporating:
- Rule-based Filtering: Discards samples with extremely low resolution, improper image aspect ratios, missing instructions or responses, and detects repeated or degenerate LaTeX code fragments.
- Model-based Filtering: Applies an advanced vision-LLM to verify the semantic relevance between instructional text and visual input, removing mismatched samples.
- Manual Source Scoring: Assigns quality ratings to data sources, further refining the inclusion criteria for the supervised corpus.
This process yields a dataset with enhanced instructional relevance, minimal text-image mismatch, and significantly mitigated textual and visual noise.
2. Dual-Level Chain-of-Thought (CoT) Enrichment
Recognizing the critical role of chain-of-thought reasoning in complex task performance, Honey-Data-15M employs a dual-level CoT response enrichment strategy:
- Short CoT (N-T mode): Concise, stepwise logical steps aimed at covering a broad array of simple reasoning problems.
- Long CoT (T mode): Extended, in-depth explanations including multi-step deductive chains and, where appropriate, LaTeX-formatted mathematical expressions. For example, the inclusion of formulas such as
within visual reasoning or geometry tasks.
This bi-modal CoT augmentation ensures both wide coverage of reasoning types and the presence of deep, multi-stage solution demonstrations, challenging models to learn both fact retrieval and extended inference.
3. Modular Data Curation Pipeline and Infrastructure
The data curation infrastructure is centered around HoneyPipe and the underlying DataStudio framework, supporting:
- Deduplication, Quality Filtering, and Enrichment: Automated, modular stages for eliminating repetitions and augmenting responses with targeted CoT instruction.
- Iterative Refinement: Curation parameters, model filters, and enrichment strategies can be dynamically configured and updated, supporting continual improvements and reproducibility.
- Flexible Output: Capable of exporting data in various formats ready for SFT on different architectures, making it extensible for new MLLM research directions.
This approach supersedes static releases with a maintainable, fully transparent pipeline adaptable to evolving community standards and downstream model requirements.
4. Model Training Recipes and Empirical Results
The Bee-8B model, an 8B-parameter fully open MLLM, was trained on Honey-Data-15M using a five-stage recipe comprising:
- MLP Warmup: Aligns unimodal representations and bridges modality gaps between text and image channels.
- Full Vision–Language Alignment: Integrates visual and textual streams before SFT.
- Main Multimodal SFT: Trains on the entire Honey-Data-15M corpus with dual-level CoT responses.
- 1M-Subset Efficient Refinement: Refines on a high-quality 1M subset for increased stability.
- Reinforcement Learning with GRPO: Applies gradient-based policy optimization to penalize repetitive or format-flawed outputs and further regularize model predictions.
Empirical benchmarks demonstrate that Bee-8B obtains state-of-the-art (SOTA) performance among fully open MLLMs and achieves competitive, sometimes superior, results relative to semi-open systems such as InternVL3.5-8B. Notably, ablation studies confirm that both dual-level CoT enrichment and systematic multi-stage SFT are directly responsible for marked improvements in both short- and long-form inference tasks.
5. Open Community Resources
The Honey-Data-15M project includes a complete open-source release:
- Corpus: All 15M samples, including dual CoT annotations and full provenance.
- Curation suite: HoneyPipe and DataStudio codebases enabling further data iteration and transparent workflows.
- Training recipes: Detailed configurations covering all pretraining and SFT processes.
- Evaluation harness: A customized version of VLMEvalKit for rigorous, benchmark-spanning evaluation.
- Model weights: Bee-8B parameters available for direct experimentation and downstream deployment.
This end-to-end stack is intended to reduce the dependence of the research community on proprietary, partially open datasets by supplying reproducible, high-quality training and evaluation resources.
6. Significance and Implications
Honey-Data-15M represents a data-centric paradigm shift in open MLLM development. By prioritizing data quality—via rigorous cleaning, CoT-driven reasoning annotation, and modular curation—it enables fully transparent models to close the performance gap with commercial systems. The dual-level CoT strategy directly supports advances in complex, multi-hop reasoning, and the open infrastructure facilitates reproducibility and broad community engagement.
A plausible implication is that as the dual-level CoT enrichment and modular pipeline methodology of Honey-Data-15M permeate the open-source ecosystem, open MLLMs will continue to reduce, or eventually eliminate, the performance deficit with their commercial counterparts (Zhang et al., 15 Oct 2025).
Table: Core Components of the Honey-Data-15M Suite
| Component | Description | Function |
|---|---|---|
| Honey-Data-15M | 15M QA pairs with dual-level CoT enrichment | SFT corpus for MLLM reasoning and alignment |
| HoneyPipe | Modular data curation pipeline | Iterative filtering, annotation, enrichment |
| DataStudio | Underlying curation framework | Workflow management, reproducibility |
| Bee-8B Model | Open 8B-parameter MLLM trained on Honey-Data-15M | Benchmark state-of-the-art performance |
| VLMEvalKit (ext.) | Customized evaluation harness | Benchmarking diverse MLLM tasks |
This integrated stack and its empirical successes position Honey-Data-15M as a foundational resource for open research in vision–language modeling, reasoning, and instruction following at scale.