Papers
Topics
Authors
Recent
2000 character limit reached

Chinese ModernBERT: Efficient Transformer

Updated 21 October 2025
  • Chinese ModernBERT is a Chinese encoder-only Transformer that integrates hardware-aware vocabulary design, dynamic whole-word masking, and extended context pre-training to address unique Chinese language challenges.
  • It employs a 32,000-token BPE vocabulary and a two-stage pre-training pipeline with Rotary Position Embeddings to optimize both short and long sequence processing.
  • The model achieves competitive accuracy on CLUE benchmarks and improved semantic similarity via contrastive fine-tuning, making it viable for large-scale Chinese NLP applications.

Chinese ModernBERT is a from-scratch Chinese encoder-only Transformer designed to integrate advances in architecture, pre-training protocols, and systems optimization specifically for the Chinese language. The model adapts various innovations—hardware-aware vocabulary design, dynamic masking curriculum, efficient long-context pre-training, and advanced learning-rate schedules—to address the unique challenges of Chinese morphology and tokenization. It achieves competitive accuracy and state-of-the-art efficiency on standard Chinese natural language understanding tasks, with further scaling potential for retrieval and semantic similarity via contrastive fine-tuning (Zhao et al., 14 Oct 2025).

1. Vocabulary Optimization and Tokenization

Chinese ModernBERT employs a 32,000-token Byte-Pair Encoding (BPE) vocabulary, constructed to maximize coverage of frequent Chinese affixes and compounds. This vocabulary size is a multiple of 64, which is beneficial for kernel tiling and hardware-level efficiency in modern GPU and accelerator systems. The resulting input compression increases the average number of characters per token compared to previous vocabularies (e.g., 21k tokens), effectively reducing the input sequence length for the same number of characters.

This compaction enables a substantial reduction in the portion of model parameters allocated to language embeddings, freeing up parameter budget for deeper Transformer computation, which is empirically shown to be a more impactful location for additional model capacity in large-scale encoder architectures. The vocabulary is constructed from large, representative corpora (CCI3-HQ, CCI4 [Chinese], Cosmopedia-Chinese), closely matching real-world distributions in both lexical and morphological space.

2. Whole-Word Masking with Dynamic Curriculum

Departing from static masking schemes, Chinese ModernBERT implements whole-word masking (WWM) that strictly preserves the atomicity of Chinese word units. WWM is guided by segmentation tools that recognize multi-character compounds and affixes unique to Chinese.

To better align pre-training task complexity with the evolving learning capability of the model, a dynamic masking curriculum is introduced. Training begins with a high masking rate (up to 30%), promoting global context modeling and forcing the model to reconstruct large swaths of information. As training progresses, the masking ratio is linearly decayed to 15%, which encourages the model to perform finer-grained, local context discrimination. This anti-curriculum design results in a phased focus—broad global alignment early, and detailed local correction late—enhancing both generalization and downstream fine-tuning stability.

3. Two-Stage Pre-training with Extended Context and Attention Design

Chinese ModernBERT's pre-training pipeline comprises two distinct stages tailored for context scaling and stability:

  • Stage I: Pre-training is conducted on sequences of up to 1,024 tokens with a large batch size to maximize throughput and initial convergence speed.
  • Stage II: After stability is achieved, the native context window is extended to 8,192 tokens with a reduced batch size and learning rates scaled to maintain constant tokens-per-update. This enables explicit modeling of long-range dependencies and allows the model to contextualize deep document-level structures, critical for Chinese legal, encyclopedic, and social media applications.

Rotary Position Embeddings (RoPE) are employed with a dual-base approach: local attention layers use θ = 10,000, while global attention layers employ θ = 80,000. Alternating local and global attention mechanisms mean the model is able to process long inputs with reduced quadratic attention cost (as only a subset of layers compute full self-attention), resulting in high efficiency for both long and short sequences.

4. Optimization Techniques: Damped-Cosine Learning Rate Schedule

The pre-training procedure uses a novel damped-cosine learning rate schedule to ensure stable convergence over very long training runs. The update is governed by:

η(s)=12[Peak(p)+Valley(p)]+12[Peak(p)Valley(p)]cos(π(2N1)p)\eta(s) = \frac{1}{2}[\mathrm{Peak}(p) + \mathrm{Valley}(p)] + \frac{1}{2}[\mathrm{Peak}(p) - \mathrm{Valley}(p)] \cdot \cos(\pi(2N-1)p)

where p=s/Sp = s/S (relative progress), NN is the number of cycles, Peak(p)\mathrm{Peak}(p) decays from $\eta_\max$ by a damping factor γ\gamma, and Valley(p)\mathrm{Valley}(p) traces a similar path to $\eta_\min$. This schedule smooths out learning rate oscillations and reduces the risk of instability or divergence that can arise from abrupt changes in sequence length or batch size across pre-training phases.

The adjustment of the learning rate based on training stage and effective batch size is critical for both constant throughput (measured in tokens/second) and stability, particularly as context size increases from 1k to 8k tokens.

5. Performance on CLUE and Efficient Throughput

Chinese ModernBERT achieves robust downstream accuracy across the CLUE benchmark suite, including tasks such as AFQMC, TNEWS, and IFLYTEK, under a unified fine-tuning protocol. Comparative evaluation shows that the model is competitive with strong Chinese encoder baselines such as RoBERTa-wwm-ext and ALBERT-xxlarge.

Resource efficiency is a key differentiator. Under bfloat16 (bf16) precision, Chinese ModernBERT processes approximately 180,100 tokens/second for 8k-token sequences, with strong efficiency for 512-token batches as well. This is attributable to the combined effects of vocabulary compression, hybrid attention, and FlashAttention kernel usage.

This throughput optimization positions Chinese ModernBERT for large-scale production deployments and high-throughput inference or batch processing, particularly for applications involving lengthy Chinese documents.

6. Retrieval and Semantic Similarity: Contrastive Fine-Tuning

Retrieval-oriented capabilities are directly addressed by contrastive fine-tuning using open datasets. Fine-tuning on SimCLUE (~3 million pairs) yields significant improvements in semantic textual similarity (STS): Pearson correlation approaches 0.488 (test set), and this rises to 0.505 when 2 million additional pairs from T2Ranking are included, with Spearman correlation increasing similarly. Under this setup, Chinese ModernBERT outperforms established baseline models such as Qwen-0.6B-embedding on SimCLUE.

A plausible implication is that even modest amounts of high-quality, curated contrastive pairs can yield meaningful improvements in STS and retrieval. The model’s architecture supports continued scaling of this capability, as further increases in contrastive fine-tuning data are expected to drive improvement in embedding quality for similarity search, question retrieval, and related retrieval-based tasks.

7. Scaling Path and Future Directions

The observed improvements on CLUE and retrieval tasks confirm the efficacy of hardware-aware vocabulary design, dynamic WWM, and long-context handling for Chinese Transformer encoders. The results further indicate a clear and practical scaling path for semantic textual similarity: incremental integration of additional labeled contrastive pairs, either from curated or semi-automatic sources, is an effective route to lifting retrieval metrics.

Chinese ModernBERT’s tokenizer and model weights are to be released, supporting reproducibility, further research in large-scale Chinese encoders, and rapid adaptation for downstream applications requiring both high accuracy and throughput in Chinese-specific settings (Zhao et al., 14 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Chinese ModernBERT.