Papers
Topics
Authors
Recent
2000 character limit reached

PCMind-2.1-Kaiyuan-2B: Open 2B LLM

Updated 10 December 2025
  • PCMind-2.1-Kaiyuan-2B is a fully open-source large language model with 2.03B parameters, built on a decoder-only Transformer architecture from the LLaMA family.
  • It integrates innovative methods such as quantile data benchmarking, strategic selective repetition, and multi-domain curriculum training to optimize resource-constrained pretraining.
  • The model employs architectural refinements like QK-Norm and sandwich normalization to stabilize FP16 training and ensure robust convergence on diverse open datasets.

PCMind-2.1-Kaiyuan-2B is a fully open-source LLM with approximately 2.03 billion parameters, designed to advance the open community's capabilities in efficient and effective pretraining under resource constraints. Developed within the LLaMA family as a decoder-only Transformer, this model introduces and integrates several methodological innovations, including quantile data benchmarking, strategic selective repetition in a multi-phase paradigm, multi-domain curriculum training, and robust preprocessing optimizations for both data and architecture. Its open release includes weights, code, processed datasets, and comprehensive training recipes, enabling transparent reproduction and extension by the research community (Luo et al., 8 Dec 2025).

1. Architectural Design and Parameterization

PCMind-2.1-Kaiyuan-2B utilizes a Transformer architecture closely aligned with the LLaMA family. The configuration comprises 28 layers, a hidden size of 2048, MLP size of 6144, 16 attention heads (8 key/value heads), rotary position embeddings with θ=10000\theta=10000, and a vocabulary of 151,936 tokens. Input and output embeddings are untied, with embedding parameters totaling 0.62 billion and “non-embedding” parameters within the Transformer blocks comprising approximately 1.41 billion, yielding about 2.03 billion total parameters.

To enhance FP16 training stability, the model architecture incorporates several interventions:

  • QK-Norm: Applied to each attention head to bound query-key dot-products and prevent instability.
  • Logits Soft-Capping: Any logit xx before Softmax or the final LM head is transformed as x=σtanh(x/σ)x' = \sigma \cdot \tanh(x/\sigma) with σ=30.0\sigma=30.0, ensuring bounded output values.
  • Sandwich Normalization: Residual blocks are structured as xl+1=xl+Normpost(F(Normpre(xl)))x_{l+1} = x_l + \text{Norm}_\text{post}(F(\text{Norm}_\text{pre}(x_l))), where FF denotes self-attention or MLP. This modification reduces the L1L_1 norm of internal activations by an order of magnitude.

These architectural innovations directly address common failure modes and instabilities arising during reduced-precision training, facilitating robust convergence on open datasets (Luo et al., 8 Dec 2025).

2. Quantile Data Benchmarking Methodology

Quantile Data Benchmarking enables systematic evaluation and comparison of heterogeneous open-source datasets by aligning them along quantiles of an intrinsic quality metric s(x)s(x). For dataset DD, the pp-th quantile threshold sps_p satisfies PrxD[s(x)sp]=p/100\Pr_{x \sim D}[s(x) \geq s_p] = p/100. Probing subsets Dp={xDs(x)sp}D_p = \{x \in D | s(x) \geq s_p\} are constructed, each truncated to a fixed token budget T10T \approx 10 billion tokens.

Reference models with 0.6\leq 0.6 billion parameters are trained on each DpD_p either from scratch or via continual learning, and their downstream performance is measured on held-out benchmarks such as MMLU and WinoGrande, yielding a non-monotonic performance curve Perf(p)\operatorname{Perf}(p).

Empirical results indicate, for example, that FineWeb-Edu eclipses DCLM-Baseline on knowledge-intensive tasks, while DCLM-Baseline is superior for commonsense tasks. Such analyses guide the strategic mixing of datasets; for instance, the top 33.4% of DCLM-Baseline is found to be on par with the entire FineWeb-Edu on MMLU, informing its use in specific pretraining phases (Luo et al., 8 Dec 2025).

3. Strategic Selective Repetition and Multi-Phase Scheduling

Pretraining is divided into five distinct phases, each with a prescribed mixture of data domains and adaptive sample repetition based on quantile rank. In phase ii, a sample xx from dataset DjD_j is retained if its quality sj(x)s_j(x) ranks within the top ki,jk_{i,j} percent and is repeated $r_{i,j} = \lceil \text{actual_ratio} \rceil$ times. Low-quality data is included only once in the early phases, while high-quality samples are amplified in later phases, improving representation without excessive token budget inflation.

A pseudocode outline governs selection and repetition, with explicit curriculum shuffling prior to each phase’s training. Empirical evidence, such as in Table 8, shows that repeating the top 33.4% of DCLM-Baseline over three epochs yields a +0.4% average gain on reasoning benchmarks compared to one-pass uniform sampling. This phased strategy thus maximizes data utility, especially for high-value examples under stringent resource limitations (Luo et al., 8 Dec 2025).

4. Multi-Domain Curriculum Training Approach

The curriculum framework operates at the instance level across KK data domains. Within each domain DiD_i, examples are ranked by the quality metric si(x)s_i(x) (or randomized for domains lacking scores), establishing a progression from “easy” to “hard”. A global rank rescaling is performed:

R(x)=ri(x)kNkNiR(x) = r_i(x) \cdot \frac{\sum_k N_k}{N_i}

where ri(x)r_i(x) is the within-domain rank and NiN_i is the dataset size. All domains are then merged and sorted by R(x)R(x), ensuring that (1) mixture ratios across domains are preserved, (2) each domain’s curriculum is respected, and (3) interleaved training proceeds across the entire multi-domain dataset. In practice, this curriculum is enacted during phases 3–5, using a final learning rate of 6×1046 \times 10^{-4} and checkpoint averaging for evaluation (Luo et al., 8 Dec 2025).

5. Data Preprocessing Pipeline and Pipeline Optimizations

The data preprocessing workflow is centered on “Kai”, a Spark-based, tree-structured, YAML-configurable pipeline. Datasets are modeled as leaves and operators—such as filtering, deduplication, sampling, and mixing—as internal nodes.

For efficiency improvements, computationally intensive operators, notably MinHash-based deduplication, are ported to high-performance C++ tasks using the Chukonu library, accelerating execution by approximately 2.5×2.5\times over native Spark. The overall pipeline processes raw sources through format unification, quality tagging, deduplication, quantile filtering, stochastic sampling, multi-domain mixing, curriculum sorting, and TFRecord output. The pipeline design ensures strict reproducibility: a given configuration always results in the identical processed dataset, facilitating open research and ablation studies (Luo et al., 8 Dec 2025).

6. Experimental Evaluation and Baseline Comparison

Evaluation is conducted using OpenCompass and OLMES for PPL-based benchmarks, following standardized multi-shot protocols per dataset. Key open-weight comparison models include Qwen2, Gemma2-2B, and Llama 3.2, while fully open baselines comprise SmolLM2-1.7B, OLMo2-1B, YuLan-Mini-2.4B, and SmolLM3-3B. Benchmarks span domains: mathematics (GSM8K, MATH), code (sanitized-MBPP, HumanEval), Chinese (C-Eval, CMMLU), and reasoning/knowledge (MMLU, ARC, BoolQ, CSQA, HellaSwag, PIQA, SocialIQa, WinoGrande).

Key results:

  • Math+Code+Chinese average: 46.05, exceeding SmolLM2-1.7B (30.64), approaching SmolLM3-3B (52.64).
  • Reasoning average: 67.74, surpassing fully open peers SmolLM2-1.7B (66.05) and OLMo2-1B (62.06), approximating YuLan-Mini-2.4B (67.50).
  • Frontier proximity: Approaches open-weight competitors, e.g., Gemma2-2B (69.16).

This places PCMind-2.1-Kaiyuan-2B at the forefront of fully open models at the 2B parameter scale (Luo et al., 8 Dec 2025).

7. Open-Source Assets and Reproducibility

All assets associated with PCMind-2.1-Kaiyuan-2B—including PyTorch and Transformers-compatible model weights, the Kai preprocessing pipeline, complete data acquisition scripts and patches, phase-wise training recipes, and deduplicated dataset shards—are released under the Apache 2.0 license at https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B. Reproduction consists of installing Spark and Chukonu, acquiring raw sources per provided HuggingFace IDs, and executing:

1
2
kai process --config config/kaiyuan-2b.yaml
torchrun train.py --cfg config/train.yaml

This fully open and reproducible release enables comprehensive evaluation and modification by the broader research community (Luo et al., 8 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to PCMind-2.1-Kaiyuan-2B.