Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 186 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Bee-8B: Open 8B Multimodal LLM

Updated 16 October 2025
  • Bee-8B is a fully open-source 8-billion-parameter multimodal language model that uses a multi-stage training regime with Chain-of-Thought enrichment.
  • It leverages the Honey-Data-15M dataset, featuring 15M multimodal samples with dual-level CoT annotations and rigorous noise filtering for high-quality data.
  • Evaluated on extensive vision-language and reasoning benchmarks, Bee-8B demonstrates state-of-the-art performance and provides a reproducible framework for further research.

Bee-8B is a fully open 8-billion-parameter multimodal LLM (MLLM) developed to advance fully open-source vision-LLMs. It was trained on the Honey-Data-15M corpus—a 15-million-sample, rigorously curated multimodal supervised fine-tuning (SFT) dataset featuring advanced Chain-of-Thought (CoT) enrichment and extensive noise filtering. Bee-8B embodies a multi-stage training philosophy with architectural refinements and is evaluated across a comprehensive battery of vision-language and reasoning benchmarks. Its release, alongside the HoneyPipe/DataStudio curation stack and open model weights, targets the persistent data quality gap in non-proprietary MLLM development and provides the community with a reproducible framework for high-quality, scalable multimodal modeling (Zhang et al., 15 Oct 2025).

1. Honey-Data-15M: Dataset Characteristics and Innovations

The Honey-Data-15M dataset serves as the backbone for Bee-8B. It comprises approximately 15M multimodal QA samples drawn from diverse domains, including general VQA, document OCR, chart/table understanding, STEM Q&A, and mathematical reasoning. A notable methodological contribution is the “dual-level Chain-of-Thought (CoT) enrichment”: samples identified as suitable for concise explanations are labeled with “short CoT” reasoning paths (12.2M samples), while those requiring extended reasoning undergo “long CoT” annotation (2.7M samples).

Data curation within HoneyPipe and DataStudio includes:

  • Rule-based filtering: Prunes samples with format errors, low-resolution visuals, text repetition, or semantic mismatches between the image and instruction.
  • Model-based verification: Uses a strong vision-LLM as a “LLM-as-a-Judge” to filter out ambiguous or inconsistent examples.
  • Fidelity assurance: Checks that enriched CoT answers match the expected factual result; only consistent samples are retained.

This results in a training set with minimized noise and enriched annotation depth, tailored for advanced multimodal and reasoning capabilities.

2. HoneyPipe and DataStudio: Transparent Data Curation Pipeline

The HoneyPipe data curation pipeline, built on the modular DataStudio framework, moves beyond static dataset releases by enabling systematic, reproducible, and adaptive collection and filtering at scale. Its multi-step logic includes:

  • Initial aggregation: Compiles raw, unfiltered samples from diverse sources.
  • Progressive filtering: Applies cascading rule-based and model-based operators to remove noise and ensure semantic fidelity.
  • Stage-wise CoT enrichment: First attempts short CoT generation; if fidelity fails, a higher-capacity model generates a long CoT, with each CoT checked for consistency against the original answer.
  • Dynamic annotation depth: Samples are routed through enrichment phases based on the complexity of the underlying task.

The pipeline is fully open, with source code and recipes provided, enabling rigorous reproduction and continued refinement by the research community.

3. Bee-8B Training Regime: Multi-stage Framework and Optimization Strategies

Bee-8B is built upon a five-stage training regimen designed to maximize multimodal alignment and reasoning performance:

  1. MLP Warmup: Trains only the MLP projector mapping vision encoder outputs to language embeddings, leveraging a subset of vision-language datasets.
  2. Vision-Language Alignment: Joint, unfrozen training of the entire stack on 12.6M multimodal pairs and substantial text-only reasoning data (including Nemotron samples), establishing cross-modal alignment.
  3. Multimodal SFT: Fine-tuning on the full Honey-Data-15M with both short and long CoT, imparting robust instruction-following and detailed step-by-step reasoning.
  4. Efficient Refinement SFT: A curated 1M-sample subset, quota-based for balanced topic distribution and dialog length, is used for further refinement.
  5. Reinforcement Learning (GRPO): Final policy optimization applies a weighted reward: enforcing format requirements (such as inclusion of answer\boxed{\text{answer}}) and factual accuracy verification against a ground-truth answer.

This staged methodology systematically optimizes both generalization and task-specific performance, with each stage contributing incremental gains documented in detailed ablation studies.

4. Benchmark Evaluation and Comparative Performance

Bee-8B is assessed using VLMEvalKit across multiple domains:

  • General VQA: Tables such as Table \ref{tab:general_vqa_p1} and Table \ref{tab:general_vqa_p2} show Bee-8B matching or outperforming semi-open models (e.g., InternVL3.5-8B) in top-1 accuracy and reasoning tasks.
  • Document OCR/Structured Data Tasks: Table \ref{tab:doc_ocr} evidences strong aptitude in reading and reasoning over structured information.
  • Math and STEM Reasoning: Table \ref{tab:math_reasoning} highlights superior CoT reasoning as a direct consequence of data enrichment and fine-tuning.

Bee-8B experiments with two inference modalities: non-thinking (deterministic, concise output) and thinking (higher temperature, longer output for detailed reasoning). Progression from raw to enriched data and through all five training stages is visualized in radar plots and ablation tables, demonstrating consistent improvements in accuracy, reasoning ability, and robustness.

5. Contributions, Impact, and Open Science Infrastructure

Bee-8B delivers several foundational contributions:

  • State-of-the-Art Fully Open MLLM: It sets new SOTA results for fully open models on multiple public benchmarks.
  • Open Corpus and Pipeline: Honey-Data-15M and the full-stack curation methodology (HoneyPipe/DataStudio) are released—including code, recipes, and documentation.
  • Model Weights and Evaluation Harness: Researchers are provided with pre-trained weights and a customizable evaluation toolkit (VLMEvalKit adaptations) for transparent benchmarking.
  • Data-centric Paradigm Shift: The results demonstrate that quality, deeply filtered and CoT-enriched data—not raw volume—is a critical driver for closing the open-source/proprietary model gap.

This infrastructure positions Bee-8B as a reproducible foundation for further research, experimentation, and downstream adaptation in multimodal tasks.

6. Technical Features and Illustrative Reasoning

Explicit mathematical reasoning is exemplified in the model outputs and ablation studies. For example, the arc length formula,

s=θ360×2πrs = \frac{\theta}{360^\circ} \times 2\pi r

is used as a demonstrative instance in chain-of-thought enriched samples, illustrating Bee-8B’s capacity for accurate multi-step calculation and instruction-following—attributes reflected in the high-level performance statistics reported across reasoning and STEM evaluations.

7. Future Directions and Research Opportunities

The Bee-8B release highlights new avenues for open MLLM research:

  • Data enrichment expansion: Further refinement and scaling of dual-level CoT techniques may continue to yield gains, particularly for advanced composite reasoning tasks.
  • Adaptation and scaling: The transparent pipeline is readily extendable to new domains, modalities, and languages.
  • Open evaluation ecosystem: Community collaboration around benchmarking and data curation will foster increasingly robust and accountable open-source alternatives to proprietary MLLMs.

Bee-8B, together with the supporting open science infrastructure, is poised to catalyze further advances in high-quality, fully open multimodal AI systems for both academic and industrial application.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Bee-8B.