Honey-Data-15M: Multimodal QA Dataset
- Honey-Data-15M is a large-scale multimodal QA dataset with 15M image–instruction–response triplets supporting both simple recognition and complex reasoning tasks.
- It employs dual-level chain-of-thought enrichment by adding ~12.2M short CoT and ~2.7M long CoT samples to guide stepwise and multi-hop logical reasoning.
- Utilizing HoneyPipe and DataStudio, the dataset benefits from rigorous multi-stage cleaning and transparent curation, closing performance gaps in fully open MLLMs.
Honey-Data-15M is a supervised fine-tuning corpus consisting of approximately 15 million question–answer (QA) pairs structured as image–instruction–response triplets, specifically designed to address critical shortcomings in fully open multimodal LLM (MLLM) training data. Its construction incorporates multi-stage data cleaning, dual-level chain-of-thought (CoT) enrichment, and is produced and maintainable via the HoneyPipe curation pipeline built on DataStudio. The dataset supports both simple recognition and complex reasoning tasks and has demonstrated the ability, via benchmark training of the Bee-8B model, to close the gap in performance between fully open and semi-open MLLMs (Zhang et al., 15 Oct 2025).
1. Dataset Structure and Domain Coverage
Honey-Data-15M comprises approximately 15 million multimodal QA samples. Each sample presents a triplet: a visual input (image), an instruction or question, and a response, explicitly supporting image-level and language-level reasoning tasks. The corpus incorporates domain diversity, systematically drawing from general, chart, document, STEM, and other modalities. This design provides broad coverage for both narrow-context (e.g., visual recognition) and high-context (e.g., scene understanding, manuscript analysis) tasks.
A distinguishing characteristic is the dual-level CoT enrichment. Approximately 12.2 million samples are augmented with “short CoT” explanations that guide a stepwise logical progression (“think” segments), while 2.7 million are supplemented with “long CoT” entries involving multi-step, in-depth reasoning. The inclusion of both levels is intended to facilitate model learning across the full spectrum of reasoning complexity, from direct retrieval to extended inference.
| Attribute | Value/Description | Significance |
|---|---|---|
| QA total | ~15 million samples | Large-scale SFT for MLLMs |
| CoT short | ~12.2 million samples | Structured stepwise reasoning |
| CoT long | ~2.7 million samples | Extended multi-step logic |
| Domains | General, Chart, Document, STEM, etc. | Supports multimodal, cross-domain SFT |
| Triplet type | image–instruction–response | Alignment for vision+language MLLMs |
This composition enables comprehensive training, fostering multi-stage reasoning and generalization to previously underrepresented instruction types.
2. Data Collection and Cleaning Methodology
Source data are aggregated from a curated pool of publicly available multimodal datasets and community-contributed repositories. The cleaning pipeline, as implemented in HoneyPipe on DataStudio, consists of rule-based and model-based filtering procedures:
- Rule-based operators filter samples by static criteria (e.g., discarding low-resolution images, removing samples with abnormal aspect ratios, or repetitive text patterns).
- Model-based filters utilize established MLLMs (e.g., Qwen2.5-VL-72B) to detect semantic mismatches between images and instructions and to flag ambiguous or unanswerable prompts.
Cleaning operates in multiple stages to ensure systematic removal of flawed samples before CoT enrichment. Samples with text/image mismatches, duplicative responses, or formatting issues are rigorously pruned. The staged approach produces a corpus substantially cleaner and more relevant than prior open-source QA datasets.
A plausible implication is that the dataset’s rigorous filtering, disclosed at each stage, minimizes noise and enhances the utility of samples for complex compositional learning.
3. Chain-of-Thought Enrichment Strategy
Chain-of-thought enrichment is central to Honey-Data-15M. Samples are programmatically augmented with either short or long CoT explanations according to instruction complexity:
- Short CoT: These explanations provide a concise, explicit logical path (“think” blocks) to the direct answer, facilitating stepwise reasoning and decompositional learning.
- Long CoT: Complex tasks require detailed, multi-turn logical chains, often bracketed by > … markers, supporting extended reasoning and solution explanation.
This enrichment method exhaustively annotates both direct-answer and multi-hop reasoning cases, ensuring coverage for generic and advanced instruction-following capabilities. Empirical evaluations indicate significant gains in model performance on benchmarks requiring multi-step reasoning and factual accuracy.
The distinction between short and long CoT is algorithmically determined, targeting only samples whose semantic structure or anticipated difficulty justifies deeper annotation.
4. HoneyPipe and DataStudio: Curation and Transparency
All curation procedures are automated through HoneyPipe, a modular pipeline developed atop DataStudio. DataStudio exposes each cleaning and enrichment operation, enabling researchers to audit, modify, or extend every aspect of dataset production.
Unlike static dataset releases, HoneyPipe and DataStudio support reproducible generation and versioning, with full transparency over applied operators. Each process—rule-based filtering, model-based semantic validation, dual-level chain-of-thought augmentation, and fidelity checks—is openly documented.
This curation infrastructure allows future users to dynamically regenerate or extend the dataset to new domains, apply alternative enrichment strategies, or tailor sampling to specific research goals. This suggests an emerging paradigm in open-corpus methodology where transparency and adaptability supersede one-off dataset publication.
5. Model Training Protocols and Benchmarking
To demonstrate the efficacy of Honey-Data-15M, the authors conduct a five-stage training sequence on the Bee-8B MLLM. The sequence is as follows:
- MLP Warmup: An MLP projector aligns visual and language features using ~1M curated image–caption pairs (source: LLaVA-OneVision and recaptioned COYO subsets). Feature mapping utilizes a two-layer MLP with GELU activation.
- Vision–Language Alignment: All model weights are unfrozen and trained across ~12.6M paired samples and additional language-only data, maintaining general linguistic competence.
- Multimodal SFT: The complete Honey-Data-15M dataset is used for supervised fine-tuning, emphasizing instruction-following and CoT reasoning skills.
- Efficient Refinement SFT: A 1M-sample subset, algorithmically selected for topic balance and a strict short vs. long CoT ratio, is used for high-fidelity, efficient final tuning.
- Reinforcement Learning with GRPO: The Group Relative Policy Optimization (GRPO) algorithm is applied, guided by rule-based reward functions to further penalize repetition and improve output formatting. A representative formula for arc-length reasoning tasks is .
Ablation studies document improvement at each stage, with Bee-8B establishing a new state-of-the-art among fully open MLLMs and matching, or surpassing, semi-open models such as InternVL3.5-8B in factual and reasoning benchmarks.
6. Impact and Paradigmatic Advances in Open MLLMs
Honey-Data-15M, along with HoneyPipe and DataStudio, marks a substantial step toward bridging the quality gap between fully open and proprietary MLLMs. The comprehensive data cleaning, domain-adaptive CoT enrichment, and reproducible curation pipeline set new standards for dataset quality.
Impact dimensions include:
- Enabling Advanced Reasoning: The dual-level CoT enrichment allows models not only to generate direct answers but also to internalize logical reasoning approaches.
- Methodological Transparency: Every curation and training step is documented, supporting reproducibility and iterative model evolution.
- Empowering the Community: Release of the corpus, curation suite, training recipes, evaluation harnesses, and model weights democratizes frontier MLLM development.
The emphasis on dataset quality—rather than sheer scale—demonstrates empirically that transparent, principled curation unlocks competitive performance for open models. A plausible implication is that future open-source MLLMs will increasingly rely on iterative, pipeline-based curation and structured reasoning enrichment to maintain parity with closed or semi-open systems.
7. Technical Specifics and Representative Formulae
Technical features include the use of model-based semantic filters (Qwen2.5-VL-72B), a two-layer MLP feature projector utilizing GELU activation for vision-language alignment, and GRPO for reinforcement learning finetuning. Representative mathematical notation used in reasoning tasks, such as arc-length computation, is
which illustrates the dataset's support for precise logical and mathematical reasoning in multimodal contexts.
Summary
Honey-Data-15M is a large-scale, meticulously curated multimodal QA dataset with dual-level chain-of-thought augmentation and transparent, reproducible curation via HoneyPipe and DataStudio. Its application in the training of Bee-8B substantiates its quality, producing fully open MLLMs that approximate or exceed the capabilities of leading semi-open alternatives. The corpus and pipeline represent a paradigm shift away from static data releases toward dynamic, auditable methodologies that prioritize reasoning depth and data fidelity as the leading criteria for model advancement (Zhang et al., 15 Oct 2025).