PanGu-$π$: Enhancing Language Model Architectures via Nonlinearity Compensation (2312.17276v1)
Abstract: The recent trend of LLMs is to increase the scale of both model size (\aka the number of parameters) and dataset to achieve better generative ability, which is definitely proved by a lot of work such as the famous GPT and Llama. However, large models often involve massive computational costs, and practical applications cannot afford such high prices. However, the method of constructing a strong model architecture for LLMs is rarely discussed. We first analyze the state-of-the-art LLM architectures and observe the feature collapse problem. Based on the theoretical analysis, we propose that the nonlinearity is also very important for LLMs, which is usually studied in convolutional neural networks for vision tasks. The series informed activation function is then introduced with tiny calculations that can be ignored, and an augmented shortcut is further used to enhance the model nonlinearity. We then demonstrate that the proposed approach is significantly effective for enhancing the model nonlinearity through carefully designed ablations; thus, we present a new efficient model architecture for establishing modern, namely, PanGu-$\pi$. Experiments are then conducted using the same dataset and training strategy to compare PanGu-$\pi$ with state-of-the-art LLMs. The results show that PanGu-$\pi$-7B can achieve a comparable performance to that of benchmarks with about 10\% inference speed-up, and PanGu-$\pi$-1B can achieve state-of-the-art performance in terms of accuracy and efficiency. In addition, we have deployed PanGu-$\pi$-7B in the high-value domains of finance and law, developing an LLM named YunShan for practical application. The results show that YunShan can surpass other models with similar scales on benchmarks.
- J. Kaplan et al. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- J. Wei et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
- T. B. Brown et al. Language models are few-shot learners. In NeurIPS, 2020.
- A. Chowdhery et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- OpenAI. Gpt-4 technical report, 2023.
- Z. Xi et al. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
- H. Touvron et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- A. Zeng et al. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations (ICLR), 2023.
- A. Yang et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
- S. Pengxiao. Lawgpt. https://github.com/pengxiao-song/LaWGPT, 2023.
- H. Yang et al. Fingpt: Open-source financial large language models. FinLLM Symposium at IJCAI 2023, 2023.
- H. Wang et al. Huatuo: Tuning llama model with chinese medical knowledge, 2023.
- A. Vaswani et al. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- B. Peng et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
- A. Katharopoulos et al. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, 2020.
- N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- W. Fedus et al. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
- Y. Tang et al. Augmented shortcuts for vision transformers. In NeurIPS, volume 34, pp. 15316–15327, 2021.
- Y. Dong et al. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In ICML, pp. 2793–2803. PMLR, 2021.
- H. Chen et al. Vanillanet: the power of minimalism in deep learning. In NeurIPS, 2023.
- OpenAI. Introducing chatgpt. OpenAI Blog, November 2022.
- OpenAI. Gpt-4 technical report. OpenAI, 2023.
- H. Touvron et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- R. Taori et al. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
- Y. Wang et al. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- W.-L. Chiang et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- D. Eccleston. Sharegpt. https://sharegpt.com/, 2023.
- I. Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023.
- X. Ren et al. Pangu-{{\{{\\\backslash\Sigma}}\}}: Towards trillion parameter language model with sparse heterogeneous computing. arXiv preprint arXiv:2303.10845, 2023.
- J. Bai et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- T. Wei et al. Skywork: A more open bilingual foundation model. arXiv preprint arXiv:2310.19341, 2023.
- M. Zaheer et al. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 2020.
- R. Child et al. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- A. Roy et al. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 2021.
- J. W. Rae et al. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
- G. Xiao et al. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
- K. Choromanski et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
- N. Kitaev et al. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- I. Beltagy et al. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- T. Brown et al. Language models are few-shot learners. Advances in neural information processing systems, 2020.
- Z. Dai et al. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- P. H. Martins et al. ∞\infty∞-former: Infinite memory transformer. arXiv preprint arXiv:2109.00301, 2021.
- Y. Sun et al. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
- M. Poli et al. Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023.
- N. Du et al. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, 2022.
- S. Roller et al. Hash layers for large sparse models. Advances in Neural Information Processing Systems, 2021.
- Z. Chi et al. On the representation collapse of sparse mixture of experts. Advances in Neural Information Processing Systems, 2022.
- M. Lewis et al. Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning, 2021.
- W. Wang et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, 2021.
- Z. Liu et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 2021.
- A. Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- J. Guo et al. Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- B. Heo et al. Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
- Z. Pan et al. Scalable vision transformers with hierarchical pooling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
- C.-F. R. Chen et al. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF international conference on computer vision, 2021.
- B. Graham et al. Levit: a vision transformer in convnet’s clothing for faster inference. In Proceedings of the IEEE/CVF international conference on computer vision, 2021.
- S. Mehta and M. Rastegari. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178, 2021.
- K. Han et al. Transformer in transformer. Advances in Neural Information Processing Systems, 2021.
- N. Parmar et al. Image transformer. In International conference on machine learning, 2018.
- X. Liu et al. Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Y. Xiong et al. Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
- M.-H. Guo et al. Beyond self-attention: External attention using two linear layers for visual tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- J. Guo et al. Hire-mlp: Vision mlp via hierarchical rearrangement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Y. Tang et al. An image patch is a wave: Phase-aware vision mlp. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- W. Yu et al. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022.
- D. Lian et al. As-mlp: An axial shifted mlp architecture for vision. arXiv preprint arXiv:2107.08391, 2021.
- S. Chen et al. Cyclemlp: A mlp-like architecture for dense prediction. arXiv preprint arXiv:2107.10224, 2021.
- L. Yuan et al. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision, 2021.
- X. Dong et al. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Y. Rao et al. Global filter networks for image classification. Advances in neural information processing systems, 2021.
- H. Touvron et al. Resmlp: Feedforward networks for image classification with data-efficient training. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- I. O. Tolstikhin et al. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 2021.
- S. Wu et al. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
- B. Workshop et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- H. Yang et al. Fingpt: Open-source financial large language models. arXiv preprint arXiv:2306.06031, 2023.
- X.-Y. Liu et al. Fingpt: Democratizing internet-scale data for financial large language models. arXiv preprint arXiv:2307.10485, 2023.
- Q. Xie et al. Pixiu: A large language model, instruction data and evaluation benchmark for finance. arXiv preprint arXiv:2306.05443, 2023.
- R. S. Shah et al. When flue meets flang: Benchmarks and large pre-trained language model for financial domain. arXiv preprint arXiv:2211.00083, 2022.
- X. Zhang and Q. Yang. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp. 4435–4439, 2023.
- W. Chen et al. Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning. arXiv preprint arXiv:2310.15205, 2023.
- Baichuan-inc. Baichuan-13b. https://github.com/baichuan-inc/Baichuan-13B, 2023.
- xuanxuanzl. Baoluo lawassistant. https://github.com/xuanxuanzl/BaoLuo-LawAssistant, 2023.
- davidpig. Lychee. https://github.com/davidpig/lychee_law, 2023.
- seudl. Jurislms: Jurisprudential language models. https://github.com/seudl/JurisLMs, 2023.
- L. Yue et al. Fedjudge: Federated legal large language model. arXiv preprint arXiv:2309.08173, 2023.
- B. McMahan et al. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. PMLR, 2017.
- M. Zhou et al. Image de-raining via continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4907–4916, 2021.
- W. He et al. Hanfei-1.0. https://github.com/siat-nlp/HanFei, 2023.
- Q. Huang et al. Lawyer llama technical report. arXiv preprint arXiv:2305.15062, 2023.
- zhihaiLLM. wisdominterrogatory. https://github.com/zhihaiLLM/wisdomInterrogatory, 2023.
- S. Wu et al. fuzi.mingcha. https://github.com/irlab-sdu/fuzi.mingcha, 2023.
- H. Liu et al. Lawgpt. https://github.com/LiuHC0428/LAW_GPT, 2023.
- CSHaitao. Lexilaw. https://github.com/CSHaitao/LexiLaw, 2023.
- J. Cui et al. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092, 2023.
- S. Yue et al. Disc-lawllm: Fine-tuning large language models for intelligent legal services, 2023.
- H. Shi et al. Revisiting over-smoothing in bert from the perspective of graph. arXiv preprint arXiv:2202.08625, 2022.
- K. He et al. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
- D. Hendrycks and K. Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- P. Ramachandran et al. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
- Y. Tang et al. Augmented shortcuts for vision transformers. Advances in Neural Information Processing Systems, 34:15316–15327, 2021.
- Y. Shibata et al. Byte pair encoding: A text compression scheme that accelerates pattern matching. Technical Report DOI-TR-161, Department of Informatics, Kyushu University, 1999.
- T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
- I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
- J. Su et al. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, pp. 127063, 2023.
- O. Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
- Y. Huang et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322, 2023.
- H. Li et al. Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023.
- D. Hendrycks et al. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- W. Zhong et al. Agieval: A human-centric benchmark for evaluating foundation models, 2023.
- C. Clark et al. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
- A. Wang et al. Superglue: A stickier benchmark for general-purpose language understanding systems, 2020.
- Y. Bisk et al. Piqa: Reasoning about physical commonsense in natural language, 2019.
- Y. Li et al. Csl: A large-scale chinese scientific literature dataset. arXiv preprint arXiv:2209.05034, 2022.
- L. Xu et al. Fewclue: A chinese few-shot learning evaluation benchmark. arXiv preprint arXiv:2107.07498, 2021.
- S. Narayan et al. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization, 2018.
- B. Hu et al. Lcsts: A large scale chinese short text summarization dataset, 2016.
- F. Xue et al. Go wider instead of deeper. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 8779–8787, 2022.
- X. Cai et al. Isotropy in the contextual embedding space: Clusters and manifolds. In International Conference on Learning Representations, 2021.
- K. Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. In Conference on Empirical Methods in Natural Language Processing, 2019.
- M. P. Marcus et al. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19:313–330, 2002.
- T. Gao et al. Simcse: Simple contrastive learning of sentence embeddings. 2021.
- Y. Cui et al. Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177, 2023.
- Tinyllama, 2023.
- M. Xia et al. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
- P. Henderson et al. Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset, 2022.
- I. Chalkidis et al. LeXFiles and LegalLAMA: Facilitating English multinational legal language model development. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15513–15535. Association for Computational Linguistics, July 2023.
- H. Zhong et al. JEC-QA: A legal-domain question answering dataset. In Proceedings of AAAI, 2020.
- C. Wang et al. Neural machine translation with byte-level subwords. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 9154–9160, 2020.
- H. Zhong et al. Jec-qa: a legal-domain question answering dataset. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 9701–9708, 2020.
- L. Zhang et al. Fineval: A chinese financial domain knowledge evaluation benchmark for large language models, 2023.
- Z. Fei et al. Lawbench: Benchmarking legal knowledge of large language models. arXiv preprint arXiv:2309.16289, 2023.