Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LICHEE: Improving Language Model Pre-training with Multi-grained Tokenization (2108.00801v2)

Published 2 Aug 2021 in cs.CL and cs.AI

Abstract: LLM pre-training based on large corpora has achieved tremendous success in terms of constructing enriched contextual representations and has led to significant performance gains on a diverse range of Natural Language Understanding (NLU) tasks. Despite the success, most current pre-trained LLMs, such as BERT, are trained based on single-grained tokenization, usually with fine-grained characters or sub-words, making it hard for them to learn the precise meaning of coarse-grained words and phrases. In this paper, we propose a simple yet effective pre-training method named LICHEE to efficiently incorporate multi-grained information of input text. Our method can be applied to various pre-trained LLMs and improve their representation capability. Extensive experiments conducted on CLUE and SuperGLUE demonstrate that our method achieves comprehensive improvements on a wide variety of NLU tasks in both Chinese and English with little extra inference cost incurred, and that our best ensemble model achieves the state-of-the-art performance on CLUE benchmark competition.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Weidong Guo (25 papers)
  2. Mingjun Zhao (13 papers)
  3. Lusheng Zhang (2 papers)
  4. Di Niu (67 papers)
  5. Jinwen Luo (4 papers)
  6. Zhenhua Liu (47 papers)
  7. Zhenyang Li (28 papers)
  8. Jianbo Tang (7 papers)
Citations (8)