Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

FPM: A Collection of Large-scale Foundation Pre-trained Language Models (2111.04909v3)

Published 9 Nov 2021 in cs.CL and cs.AI

Abstract: Large-scale Transformer models have significantly promoted the recent development of natural language processing applications. However, little effort has been made to unify the effective models. In this paper, driven by providing a new set of baseline models in the future, we adopt various novel transformer architectures and launch a model set with the help of recent mainstream technologies. We focus the discussions on optimizing the depth of the networks based on the existing powerful encode-decoder structures. We show that by properly avoiding training defects such as non-convergence and degradation, scaling up off-the-shelf transformer architectures consistently delivers better performance. To stimulate future research on large-scale LLM pretraining, we present extensive results and detailed discussions on network performance improvements with respect to the network depth and confirm the existence of the optimal number of layers under specific tasks. To the best of our knowledge, we provide the largest Chinese generative model and the largest Chinese encoding model. The BERT LLMs we trained on English datasets deliver a 14.45% higher F1 score than the Turing-NLR.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Authors (1)