Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GujiBERT and GujiGPT: Construction of Intelligent Information Processing Foundation Language Models for Ancient Texts (2307.05354v1)

Published 11 Jul 2023 in cs.CL

Abstract: In the context of the rapid development of LLMs, we have meticulously trained and introduced the GujiBERT and GujiGPT LLMs, which are foundational models specifically designed for intelligent information processing of ancient texts. These models have been trained on an extensive dataset that encompasses both simplified and traditional Chinese characters, allowing them to effectively handle various natural language processing tasks related to ancient books, including but not limited to automatic sentence segmentation, punctuation, word segmentation, part-of-speech tagging, entity recognition, and automatic translation. Notably, these models have exhibited exceptional performance across a range of validation tasks using publicly available datasets. Our research findings highlight the efficacy of employing self-supervised methods to further train the models using classical text corpora, thus enhancing their capability to tackle downstream tasks. Moreover, it is worth emphasizing that the choice of font, the scale of the corpus, and the initial model selection all exert significant influence over the ultimate experimental outcomes. To cater to the diverse text processing preferences of researchers in digital humanities and linguistics, we have developed three distinct categories comprising a total of nine model variations. We believe that by sharing these foundational LLMs specialized in the domain of ancient texts, we can facilitate the intelligent processing and scholarly exploration of ancient literary works and, consequently, contribute to the global dissemination of China's rich and esteemed traditional culture in this new era.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Dongbo Wang (8 papers)
  2. Chang Liu (863 papers)
  3. Zhixiao Zhao (4 papers)
  4. Si Shen (11 papers)
  5. Liu Liu (190 papers)
  6. Bin Li (514 papers)
  7. Haotian Hu (14 papers)
  8. Mengcheng Wu (1 paper)
  9. Litao Lin (2 papers)
  10. Xue Zhao (7 papers)
  11. Xiyu Wang (24 papers)
Citations (6)