Papers
Topics
Authors
Recent
2000 character limit reached

PCMind-2.1-Kaiyuan-2B Technical Report (2512.07612v1)

Published 8 Dec 2025 in cs.CL, cs.AI, and cs.LG

Abstract: The rapid advancement of LLMs has resulted in a significant knowledge gap between the open-source community and industry, primarily because the latter relies on closed-source, high-quality data and training recipes. To address this, we introduce PCMind-2.1-Kaiyuan-2B, a fully open-source 2-billion-parameter model focused on improving training efficiency and effectiveness under resource constraints. Our methodology includes three key innovations: a Quantile Data Benchmarking method for systematically comparing heterogeneous open-source datasets and providing insights on data mixing strategies; a Strategic Selective Repetition scheme within a multi-phase paradigm to effectively leverage sparse, high-quality data; and a Multi-Domain Curriculum Training policy that orders samples by quality. Supported by a highly optimized data preprocessing pipeline and architectural modifications for FP16 stability, Kaiyuan-2B achieves performance competitive with state-of-the-art fully open-source models, demonstrating practical and scalable solutions for resource-limited pretraining. We release all assets (including model weights, data, and code) under Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B.

Summary

  • The paper introduces a fully open-source 2B-parameter LLM pretraining pipeline that integrates quantile data benchmarking and multi-phase quality sampling.
  • It details a Spark-based processing pipeline with C++ acceleration to ensure numerical stability and reproducibility under constrained hardware.
  • The work demonstrates improved benchmark performance and practical efficiency for resource-limited LLM research compared to similar open-weight models.

PCMind-2.1-Kaiyuan-2B: A Fully Open-Source, Resource-Efficient LLM Pretraining Pipeline

Motivation and Scope

The technical report "PCMind-2.1-Kaiyuan-2B Technical Report" (2512.07612) presents a comprehensive fully open-source 2B-parameter LLM, detailing both model release and transparent training methodology. The authors explicitly target the knowledge-production gap between the proprietary practices of industry-scale LLMs and the limited-resource, open-weight initiatives in academia. All assets, including model weights, datasets, and code, are available under Apache 2.0, prioritizing genuine reproducibility and an unrestricted research license.

Central challenges addressed include: (1) systematically comparing and mixing heterogeneous open-source pretraining corpora amidst drastic feature, quality, and label variations, and (2) devising data-efficient methods that maximize the utility of sparse, high-quality data under compute and token constraints. The work emphasizes practical solutions for the broader LLM research ecosystem, especially those constrained to limited clusters and commodity hardware.

Quantile Data Benchmarking: Dataset Comparison with Quality Granularity

To inform curriculum and mixture strategy, the authors introduce "Quantile Data Benchmarking" as a systematic, empirical layer over rule-based dataset curation. Given that open datasets (e.g., DCLM-Baseline, FineWeb-Edu) often provide rule-derived or classifier-based quality scores, this approach benchmarks slices of each dataset stratified by quality-score quantiles, rather than simple global filtering. Reference models are trained and evaluated over subsets near selected quantiles, quantifying true utility along the score spectrum for various downstream tasks. Figure 1

Figure 1: Illustration of the quantile benchmarking process used to probe dataset granularity and guide data selection/mixing policies.

Empirically, this approach reveals marked internal heterogeneity within datasets and capability-dependent superiority (e.g., FineWeb-Edu dominates in knowledge-intensive tasks such as MMLU, while DCLM-Baseline yields higher performance on commonsense/factual benchmarks like WinoGrande). Figure 2

Figure 2

Figure 2: Task-dependent dataset characteristics discovered by quantile benchmarking—FineWeb-Edu is optimal for knowledge benchmarks, DCLM baseline excels at commonsense.

Non-monotonicity is also observed between quality score and final-task performance, motivating reluctance toward naive “higher is always better” sample selection. These results underscore the inadequacy of top-kk filtering or singular metric dependence for open-source LLM data curation.

Data Processing Infrastructure and Architectural Stability

Handling the scale and operational diversity of open-source datasets requires robust preprocessing workflows. The report details a Spark-based data processing pipeline (Kai), augmented by native C++ acceleration via Chukonu. This architecture ensures full-yaml pipeline reproducibility, with strong support for deduplication, interleaving, rank-rescaling, and distributed execution.

Under hardware constraints—FP16 clusters with no bfloat16 support—the report carefully addresses numerical instability stemming from pre-norm and residual connection accumulation in standard architectures. As shown in the comparison of activations, unmodified transformer variants suffer from activation explosion and gradient underflow. Figure 3

Figure 3

Figure 3: Activation statistics highlight instability in the baseline, motivating FP16-specific architectural interventions.

To ensure stable training, the model stack incorporates sandwich normalization and soft-capping (logits), following recent best practices demonstrated in, e.g., Gemma 2. The result is reliable convergence and prevented numerical over/underflow across curriculum domains (language, code, math).

Multi-Phase, Multi-Domain Curriculum and Strategic Repetition

The model’s core data efficiency gains arise from a phased curriculum. Training proceeds through five phases, transitioning from broad, lower-quality mixtures toward increasing proportions of high-quality, domain-specific, and supervised samples. Each phase involves both domain-level mixture adaptation (e.g., adjusting proportions of Chinese, code, math, SFT), and quality-based selective repetition, where high-quality partitions are repeated more frequently in subsequent phases. Figure 4

Figure 4

Figure 4: Phase-wise data mixture transitions, showing the curriculum’s progression from broad coverage to focused, high-quality slices.

The multi-dataset curriculum is constructed via a principled rank-rescaling and interleaving process, guaranteeing that quality metrics are preserved across and within datasets, and that mixture ratios respect quantile benchmarking insights. Figure 5

Figure 5: Data flow for curriculum construction—a globally consistent rank is assigned for mixing/shuffling across dataset boundaries.

This curriculum is tightly coupled to an LR schedule optimized for late-phase quality injection, with checkpoint averaging to consolidate gains from the highest-quality steps. Figure 6

Figure 6: Training dynamics showing phase transitions, LR schedule, and shifts in loss and validation metrics over the curriculum.

Evaluation and Model Positioning

Benchmarks place PCMind-2.1-Kaiyuan-2B ahead of previous fully open-source models at similar parameter scales, and approaching top-tier open-weight models (Qwen2-1.5B, Gemma2-2B, Llama3.2-3B). Strong out-of-domain (Chinese, math, code) and reasoning/knowledge results are reported, especially when considering non-embedding parameter efficiency. Figure 7

Figure 7: Kaiyuan-2B’s performance advances the open-source model frontier and approaches open-weight baselines of similar size.

Figure 8

Figure 8: Non-embedding parameter comparison, highlighting architectural and efficiency advantages over open and semi-open competitors.

Notably, for tasks in code synthesis (HumanEval, MBPP), mathematical reasoning (MATH, GSM8K), and Chinese (C-Eval, CMMLU), the model outperforms SmolLM2/3-1.7B/3B and OLMo-2-1B, and approaches or exceeds YuLan-Mini-2.4B despite a considerably smaller data budget.

Practical and Theoretical Implications

Kaiyuan-2B demonstrates that transparent, phased curation, heterogeneity-aware benchmarking, and principled selective repetition offer concrete advances in compute-, data-, and resource-constrained LLM pretraining. The work rejects monolithic, undifferentiated open-data mixing, showing that task-driven benchmarking and dynamic curricula are required to match or approach the efficiency of industry-trained closed models.

Practically, the open-source pipeline enables reproducible, large-scale LLM research on moderate clusters, without access to restricted corpora or proprietary recipes. Theoretically, quantile benchmarking questions the validity of global quality metrics and sets a precedent for capability-specific, empirical dataset utility evaluation.

A natural avenue for future study includes more granular, quantitative frameworks for dataset mixing optimization, theoretical analysis of curriculum/repetition dynamics, and exploration of cross-lingual and low-resource domain adaptation.

Conclusion

PCMind-2.1-Kaiyuan-2B operationalizes a reproducible, fully open-source LLM training pipeline that advances state-of-the-art results at the 2B parameter scale using transparent, resource-efficient strategies. The explicit integration of quantile benchmarking, multi-phase quality-focused sampling, and robust normalization ensures both competitive benchmark performance and credible community reproducibility across computational environments. The model and training pipeline serve as a reference for the next generation of academic LLM exploration and resource-constrained LLM engineering.

Whiteboard

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 6 likes about this paper.