Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Overview on Language Models: Recent Developments and Outlook (2303.05759v2)

Published 10 Mar 2023 in cs.CL
An Overview on Language Models: Recent Developments and Outlook

Abstract: LLMing studies the probability distributions over strings of texts. It is one of the most fundamental tasks in NLP. It has been widely used in text generation, speech recognition, machine translation, etc. Conventional LLMs (CLMs) aim to predict the probability of linguistic sequences in a causal manner, while pre-trained LLMs (PLMs) cover broader concepts and can be used in both causal sequential modeling and fine-tuning for downstream applications. PLMs have their own training paradigms (usually self-supervised) and serve as foundation models in modern NLP systems. This overview paper provides an introduction to both CLMs and PLMs from five aspects, i.e., linguistic units, architectures, training methods, evaluation methods, and applications. Furthermore, we discuss the relationship between CLMs and PLMs and shed light on the future directions of LLMing in the pre-trained era.

An Overview of LLMs: Recent Developments and Outlook

The paper authored by Chengwei Wei, Yun-Cheng Wang, Bin Wang, and C.-C. Jay Kuo provides a thorough review of LLMs (LMs), exploring their evolution, current state, and future prospects. The paper contrasts conventional LLMs (CLMs) with pre-trained LLMs (PLMs) and explores their various aspects including linguistic units, architectures, training methods, evaluation techniques, and applications.

Fundamentals of LLMs

LLMs are designed to paper the probability distributions over sequences of linguistic units, such as words or characters. Historically, CLMs predominantly utilized statistical approaches based on small corpora or data-driven approaches leveraging larger datasets. These models aim to predict the next linguistic unit in a sequence given its preceding context, functioning in a causal or auto-regressive manner.

PLMs, on the other hand, extend beyond simple causality. They employ self-supervised learning paradigms and serve as foundational models in modern NLP systems. The strength of PLMs lies in their ability to generalize across diverse downstream tasks, achieved through extensive pre-training on broad linguistic data followed by fine-tuning.

Types of LLMs

The paper categorizes LMs into several types:

  1. Conventional LLMs (CLMs): Primarily causal and auto-regressive, predicting the probability of the next unit based on prior context.
  2. Structural LMs: Utilize predefined linguistic structures like dependency or parse trees to bring semantically relevant context closer to the unit being predicted.
  3. Bidirectional LMs: Utilize contexts from both directions, such as masked LLMs (MLM), which predict masked tokens using both preceding and succeeding contexts.
  4. Permutation LMs: Combine the strengths of CLMs and MLMs by randomizing input sequences, generating various permutations for token prediction.

Linguistic Units and Tokenization

Tokenization methods are crucial for decomposing text sequences into manageable linguistic units:

  • Characters: Simplify vocabulary but require longer contexts for accurate predictions.
  • Words and Subwords: Commonly used but face challenges like Out-Of-Vocabulary (OOV) issues. Subword tokenizers like Byte Pair Encoding (BPE) and WordPiece have been developed to address these challenges.
  • Phrases and Sentences: Used in specific applications such as speech recognition and text summarization to maintain semantic coherence.

Architectures of LLMs

The architectures of LMs have evolved significantly:

  • N-gram Models: Simplified models that predict the next token based on the preceding N-1 tokens using the Markov assumption.
  • Maximum Entropy Models: Utilize feature functions for token prediction but can be computationally intensive.
  • Neural Network Models: Include Feed-forward Neural Networks (FNNs) and Recurrent Neural Networks (RNNs), both of which leverage continuous embedding spaces for better context management.
  • Transformers: The recent state-of-the-art models that utilize attention mechanisms to capture long-term dependencies. Variants include encoder-only, decoder-only, and encoder-decoder based on the task requirements.

Training Methods for PLMs

PLMs are trained via large-scale self-supervised learning:

  • Pre-training: Often involves masked LLMing or next-sentence prediction to learn generalizable language representations.
  • Fine-Tuning: Adapts pre-trained models to specific downstream tasks using task-specific datasets. Techniques like adapter tuning and prompt tuning have emerged for more efficient fine-tuning.

Evaluation Methods

Evaluations are categorized into intrinsic and extrinsic methods:

  • Intrinsic Evaluation: Metrics like perplexity and pseudo-log-likelihood scores (PLL) are used to measure how well an LM can predict natural text sequences.
  • Extrinsic Evaluation: Performance on downstream tasks like the GLUE and SuperGLUE benchmarks provides insights into the practical utility of LMs.

Applications in Text Generation

The application of LMs in text generation spans various tasks including dialogue systems, automatic speech recognition (ASR), and machine translation. Efficient decoding methods like beam search and sampling-based techniques play vital roles in improving the quality of generated text.

Improving Efficiency

Given the increasing complexity and size of modern LMs, the paper highlights the importance of efficient model training and usage. Techniques such as knowledge distillation, pruning, and fast decoding methods are discussed to reduce model size and inference latency without compromising performance.

Future Directions

The paper outlines several promising research directions:

  • Integration of LMs and Knowledge Graphs (KGs): Combining the structured knowledge of KGs with the contextual understanding of LMs can enhance reasoning capabilities.
  • Incremental Learning: Developing methods to update LMs with new information without retraining from scratch.
  • Lightweight Models: Creating cost-effective and environmentally friendly models.
  • Domain-Specific Models: Exploring the benefits of specialized models over universal LMs for specific domains.
  • Interpretable Models: Enhancing the transparency and explainability of LMs to avoid issues like hallucination in text generation.
  • Detection of Machine-Generated Text: Developing reliable methods to differentiate between human-written and machine-generated content.

In conclusion, the paper comprehensively covers the landscape of LLMs, providing valuable insights into their development, applications, and future prospects in NLP research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (223)
  1. Improving language understanding by generative pre-training. 2018.
  2. Class-based n-gram models of natural language. Computational linguistics, 18(4):467–480, 1992.
  3. Marcello Federico. Bayesian estimation methods for n-gram language model adaptation. In Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP’96, volume 1, pages 240–243. IEEE, 1996.
  4. A variable-length category-based n-gram language model. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, volume 1, pages 164–167. IEEE, 1996.
  5. Adaptive language modeling using minimum discriminant estimation. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, 1992.
  6. A maximum entropy approach to natural language processing. Computational linguistics, 22(1):39–71, 1996.
  7. Roni Rosenfeld. A maximum entropy approach to adaptive statistical language modeling. 1996.
  8. A neural probabilistic language model. Advances in neural information processing systems, 13, 2000.
  9. Recurrent neural network based language model. In Interspeech, volume 2, pages 1045–1048. Makuhari, 2010.
  10. Design of a linguistic statistical decoder for the recognition of continuous speech. IEEE Transactions on Information Theory, 21(3):250–256, 1975.
  11. Frederick Jelinek. Continuous speech recognition by statistical methods. Proceedings of the IEEE, 64(4):532–556, 1976.
  12. A maximum likelihood approach to continuous speech recognition. IEEE transactions on pattern analysis and machine intelligence, (2):179–190, 1983.
  13. A statistical approach to machine translation. Computational linguistics, 16(2):79–85, 1990.
  14. An efficient a* search algorithm for statistical machine translation. In Proceedings of the ACL 2001 Workshop on Data-Driven Methods in Machine Translation, 2001.
  15. A decoder for syntax-based statistical mt. In Proceedings of the 40th Annual meeting of the Association for Computational Linguistics, pages 303–310, 2002.
  16. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  17. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  18. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.
  19. Synwmd: Syntax-aware word mover’s distance for sentence similarity evaluation. Pattern Recognition Letters, 170:48–55, 2023.
  20. SBERT-WK: A sentence embedding method by dissecting bert-based word models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2146–2157, 2020.
  21. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, 2021.
  22. Relational sentence embedding for flexible semantic matching. arXiv preprint arXiv:2212.08802, 2022.
  23. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 63(10):1872–1897, 2020.
  24. Pre-trained models: Past, present and future. AI Open, 2:225–250, 2021.
  25. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  26. Exploiting syntactic structure for language modeling. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1, pages 225–231, 1998.
  27. Structured language modeling. Computer Speech & Language, 14(4):283–332, 2000.
  28. Dependency language models for sentence completion. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1405–1410, 2013.
  29. Dependency recurrent neural language models for sentence completion. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 511–517, 2015.
  30. Task-specific dependency-based word embedding methods. Pattern Recognition Letters, 2022.
  31. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019.
  32. Generating text with recurrent neural networks. In ICML, 2011.
  33. Character-level language modeling with hierarchical recurrent neural networks. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 5720–5724. IEEE, 2017.
  34. Character-aware neural language models. In Thirtieth AAAI conference on artificial intelligence, 2016.
  35. Character-level language modeling with deeper self-attention. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3159–3166, 2019.
  36. Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306, 2022.
  37. Mandarin word-character hybrid-input neural network language model. In Twelfth Annual Conference of the International Speech Communication Association, 2011.
  38. Gated word-character recurrent language model. In 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, pages 1992–1997. Association for Computational Linguistics (ACL), 2016.
  39. Character-word lstm language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 417–427, 2017.
  40. Subword language modeling with neural networks. preprint (http://www. fit. vutbr. cz/imikolov/rnnlm/char. pdf), 8(67), 2012.
  41. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, 2016.
  42. Philip Gage. A new algorithm for data compression. C Users Journal, 12(2):23–38, 1994.
  43. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  44. Japanese and korean voice search. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5149–5152. IEEE, 2012.
  45. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  46. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.
  47. Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, 2018.
  48. Byte pair encoding is suboptimal for language model pretraining. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617–4624, 2020.
  49. Data-driven determination of appropriate dictionary units for korean lvcsr. In Proceedings of ICASSP, pages 323–327, 1999.
  50. Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Helsinki University of Technology Helsinki, 2005.
  51. Morph-based speech recognition and modeling of out-of-vocabulary words across languages. ACM Transactions on Speech and Language Processing (TSLP), 5(1):1–29, 2007.
  52. Large vocabulary continuous speech recognition of an inflected language using stems and endings. Speech communication, 49(6):437–452, 2007.
  53. Joint morphological-lexical language modeling (jmllm) for arabic. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, volume 4, pages IV–181. IEEE, 2007.
  54. Morphology-based and sub-word language modeling for turkish speech recognition. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5402–5405. IEEE, 2010.
  55. Uyghur morpheme-based language models and asr. In IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS, pages 581–584. IEEE, 2010.
  56. Towards better language models for spontaneous speech. 1994.
  57. Class phrase models for language modeling. In Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP’96, volume 1, pages 398–401. IEEE, 1996.
  58. Data-driven approach to designing compound words for continuous speech recognition. IEEE transactions on Speech and audio processing, 9(4):327–332, 2001.
  59. Word-phrase-entity language models: Getting more mileage out of n-grams. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
  60. Ronald Rosenfeld. A whole sentence maximum entropy language model. In 1997 IEEE workshop on automatic speech recognition and understanding proceedings, pages 230–237. IEEE, 1997.
  61. Efficient sampling and feature selection in whole sentence maximum entropy language models. In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), volume 1, pages 549–552. IEEE, 1999.
  62. Whole-sentence exponential language models: a vehicle for linguistic-statistical integration. Computer Speech & Language, 15(1):55–73, 2001.
  63. Toward better storylines with sentence-level language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7472–7478, 2020.
  64. Slm: Learning a discourse language representation with sentence unshuffling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1551–1562, 2020.
  65. Gaussian mixture language models for speech recognition. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, volume 4, pages IV–29. IEEE, 2007.
  66. Ergodic hidden markov models and polygrams for language modeling. In Proceedings of ICASSP’94. IEEE International Conference on Acoustics, Speech and Signal Processing, volume 1, pages I–357. IEEE, 1994.
  67. George James Lidstone. Note on the general case of the bayes-laplace formula for inductive or a posteriori probabilities. Transactions of the Faculty of Actuaries, 8(182-192):13, 1920.
  68. William Ernest Johnson. Probability: The deductive and inductive problems. Mind, 41(164):409–423, 1932.
  69. Frederick Jelinek. Interpolated estimation of markov source parameters from sparse data. In Proc. Workshop on Pattern Recognition in Practice, 1980, 1980.
  70. Slava Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE transactions on acoustics, speech, and signal processing, 35(3):400–401, 1987.
  71. A comparison of the enhanced good-turing and deleted estimation methods for estimating probabilities of english bigrams. Computer Speech & Language, 5(1):19–54, 1991.
  72. Improved backing-off for m-gram language modeling. In 1995 international conference on acoustics, speech, and signal processing, volume 1, pages 181–184. IEEE, 1995.
  73. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359–394, 1999.
  74. Generalized iterative scaling for log-linear models. The annals of mathematical statistics, pages 1470–1480, 1972.
  75. Training neural network language models on very large corpora. In Proceedings of human language technology conference and conference on empirical methods in natural language processing, pages 201–208, 2005.
  76. Holger Schwenk. Continuous space language models. Computer Speech & Language, 21(3):492–518, 2007.
  77. Deep neural network language models. In Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT, pages 20–28, 2012.
  78. Extensions of recurrent neural network language model. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5528–5531. IEEE, 2011.
  79. Recurrent neural network based language modeling in meeting recognition. In Interspeech, volume 11, pages 2877–2880, 2011.
  80. Lstm neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, 2012.
  81. Breaking the softmax bottleneck: A high-rank rnn language model. arXiv preprint arXiv:1711.03953, 2017.
  82. Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116, 1998.
  83. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  84. Global context-dependent recurrent neural network language model with sparse feature learning. Neural Computing and Applications, 31(2):999–1011, 2019.
  85. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  86. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  87. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  88. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, 2019.
  89. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, 2020.
  90. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
  91. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
  92. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  93. Adapterfusion: Non-destructive task composition for transfer learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 487–503, 2021.
  94. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  95. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255–269, 2021.
  96. Few-shot text generation with natural language instructions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 390–402, 2021.
  97. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, 2021.
  98. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022.
  99. Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495, 2023.
  100. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021.
  101. Factual probing is [mask]: Learning vs. learning to recall. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5017–5033, 2021.
  102. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, 2021.
  103. Learning how to ask: Querying lms with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5203–5212, 2021.
  104. Bert has a mouth, and it must speak: Bert as a markov random field language model. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pages 30–36, 2019.
  105. Masked language model scoring. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2699–2712, 2020.
  106. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, 2018.
  107. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
  108. What do you learn from context? probing for sentence structure in contextualized word representations. In 7th International Conference on Learning Representations, ICLR 2019, 2019.
  109. Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 240–248, 2018.
  110. Bert rediscovers the classical nlp pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, 2019.
  111. Are pre-trained language models aware of phrases? simple but strong baselines for grammar induction. arXiv preprint arXiv:2002.00737, 2020.
  112. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, 2019.
  113. A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics, 8:842–866, 2020.
  114. A mathematical exploration of why language models help solve downstream tasks. arXiv preprint arXiv:2010.03648, 2020.
  115. Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning. Advances in Neural Information Processing Systems, 34:16158–16170, 2021.
  116. Energy and policy considerations for deep learning in nlp. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650, 2019.
  117. The computational limits of deep learning. arXiv preprint arXiv:2007.05558, 2020.
  118. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021.
  119. Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems, 4:795–813, 2022.
  120. On measuring social biases in sentence encoders. In Proceedings of NAACL-HLT, pages 622–628, 2019.
  121. Language (technology) is power: A critical survey of “bias” in nlp. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, 2020.
  122. Stereoset: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5356–5371, 2021.
  123. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 298–306, 2021.
  124. Models in the wild: On corruption robustness of neural nlp systems. In Neural Information Processing: 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, December 12–15, 2019, Proceedings, Part III 26, pages 235–247. Springer, 2019.
  125. Robust natural language processing: Recent advances, challenges, and future directions. IEEE Access, 2022.
  126. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 8018–8025, 2020.
  127. Word-level textual adversarial attacking as combinatorial optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6066–6080, 2020.
  128. Adversarial nli: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, 2020.
  129. Robustness gym: Unifying the nlp evaluation landscape. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, pages 42–55, 2021.
  130. Textflint: Unified multilingual robustness evaluation toolkit for natural language processing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 347–355, 2021.
  131. On the explainability of natural language processing deep models. ACM Computing Surveys, 55(5):1–31, 2022.
  132. Are red roses red? evaluating consistency of question-answering models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6174–6184, 2019.
  133. An overview on generative ai at scale with edge-cloud computing. 2023.
  134. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  135. Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 654–664, 2017.
  136. Neural response generation via gan with an approximate embedding layer. In Proceedings of the 2017 conference on empirical methods in natural language processing, pages 617–626, 2017.
  137. A simple, fast diverse decoding algorithm for neural generation. arXiv preprint arXiv:1611.08562, 2016.
  138. Diverse beam search for improved description of complex scenes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  139. Importance of search and evaluation strategies in neural dialogue modeling. In Proceedings of the 12th International Conference on Natural Language Generation, pages 76–87, 2019.
  140. Trainable greedy decoding for neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1968–1978, 2017.
  141. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, 2018.
  142. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  143. Recent advances in deep learning based dialogue systems: A systematic survey. Artificial intelligence review, pages 1–101, 2022.
  144. A focused study on sequence length for dialogue summarization. arXiv preprint arXiv:2209.11910, 2022.
  145. Analyzing and evaluating faithfulness in dialogue summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4897–4908, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
  146. Probing task-oriented dialogue representation from language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5036–5051, Online, November 2020. Association for Computational Linguistics.
  147. Language models as few-shot learner for task-oriented dialogue systems. arXiv preprint arXiv:2008.06239, 2020.
  148. Self-organized language modeling for speech recognition. In Readings in speech recognition. Citeseer, 1990.
  149. Variable n-grams and extensions for conversational speech language modeling. IEEE Transactions on Speech and Audio Processing, 8(1):63–75, 2000.
  150. Converting neural network language models into back-off language models for efficient decoding in automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(1):184–192, 2013.
  151. Bidirectional recurrent neural network language models for automatic speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5421–5425. IEEE, 2015.
  152. Cache based recurrent neural network language model inference for first pass speech recognition. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6354–6358. IEEE, 2014.
  153. Effective sentence scoring method using bert for speech recognition. In Asian Conference on Machine Learning, pages 1081–1093. PMLR, 2019.
  154. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4960–4964. IEEE, 2016.
  155. An analysis of incorporating an external language model into a sequence-to-sequence model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5828. IEEE, 2018.
  156. A comparison of techniques for language model integration in encoder-decoder speech recognition. In 2018 IEEE spoken language technology workshop (SLT), pages 369–375. IEEE, 2018.
  157. On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535, 2015.
  158. Towards better decoding and language model integration in sequence to sequence models. Proc. Interspeech 2017, pages 523–527, 2017.
  159. A density ratio approach to language model fusion in end-to-end automatic speech recognition. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 434–441. IEEE, 2019.
  160. Cold fusion: Training seq2seq models together with language models. Proc. Interspeech 2018, pages 387–391, 2018.
  161. Component fusion: Learning replaceable language model component for end-to-end speech recognition system. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5361–5635. IEEE, 2019.
  162. Learning deep transformer models for machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1810–1822, 2019.
  163. Max Weiss. Deepfake bot submissions to federal public comment websites cannot be distinguished from human submissions. Technology Science, 2019.
  164. Authorship attribution for neural text generation. In Conf. on Empirical Methods in Natural Language Processing (EMNLP), 2020.
  165. Tweepfake: About detecting deepfake tweets. Plos one, 16(5):e0251415, 2021.
  166. Automated fact checking: Task formulations, methods and future directions. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3346–3359, 2018.
  167. Automatic detection of generated text is easiest when humans are fooled. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1808–1822, 2020.
  168. Gltr: Statistical detection and visualization of generated text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 111–116, 2019.
  169. Defending against neural fake news. Advances in neural information processing systems, 32, 2019.
  170. Automatic detection of machine generated text: A critical survey. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2296–2309, 2020.
  171. Detecting computer-generated disinformation. International Journal of Data Science and Analytics, 13(4):363–383, 2022.
  172. The cost of training nlp models: A concise overview. arXiv preprint arXiv:2004.08900, 2020.
  173. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
  174. When do you need billions of words of pretraining data? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1112–1125, 2021.
  175. Pre-training transformers as energy-based cloze models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 285–294, 2020.
  176. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, 2020.
  177. Structured pruning of large language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6151–6162, 2020.
  178. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5797–5808, 2019.
  179. Parameter-efficient transfer learning with diff pruning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4884–4896, 2021.
  180. Tinybert: Distilling bert for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, 2020.
  181. Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962, 2019.
  182. A tensorized transformer for language modeling. Advances in neural information processing systems, 32, 2019.
  183. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  184. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
  185. Recurrent stacking of layers for compact neural machine translation models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6292–6299, 2019.
  186. Leveraging pre-trained checkpoints for sequence generation tasks. Transactions of the Association for Computational Linguistics, 8:264–280, 2020.
  187. Non-autoregressive neural machine translation. In International Conference on Learning Representations (ICLR), 2018.
  188. Non-autoregressive text generation with pre-trained language models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 234–243, 2021.
  189. Elmer: A non-autoregressive pre-trained language model for efficient and effective text generation. arXiv preprint arXiv:2210.13304, 2022.
  190. Kgboost: A classification-based knowledge base completion method with negative sampling. Pattern Recognition Letters, 157:104–111, 2022.
  191. Compounde: Knowledge graph embedding with translation, rotation and scaling compound operations. arXiv preprint arXiv:2207.05324, 2022.
  192. Knowledge graph embedding based question answering. In Proceedings of the twelfth ACM international conference on web search and data mining, pages 105–113, 2019.
  193. Knowledge graph-augmented abstractive summarization with semantic-driven cloze reward. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5094–5107, 2020.
  194. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, 2019.
  195. How context affects language models’ factual predictions. In Automated Knowledge Base Construction, 2020.
  196. Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online, June 2021. Association for Computational Linguistics.
  197. ERNIE: Enhanced language representation with informative entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1441–1451, Florence, Italy, July 2019. Association for Computational Linguistics.
  198. KLMo: Knowledge graph enhanced pretrained language model with fine-grained relationships. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4536–4542, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
  199. Deep bidirectional language-knowledge graph pretraining. Advances in Neural Information Processing Systems, 35:37309–37323, 2022.
  200. GreaseLM: Graph REASoning enhanced language models. In International Conference on Learning Representations, 2022.
  201. Pretrained encyclopedia: Weakly supervised knowledge-pretrained language model. arXiv preprint arXiv:1912.09637, 2019.
  202. KEPLER: A unified model for knowledge embedding and pre-trained language representation. Transactions of the Association for Computational Linguistics, 9:176–194, 2021.
  203. QA-GNN: Reasoning with language models and knowledge graphs for question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 535–546, Online, June 2021. Association for Computational Linguistics.
  204. Empowering language models with knowledge graph reasoning for open-domain question answering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9562–9581, 2022.
  205. Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
  206. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  207. Time waits for no one! analysis and challenges of temporal misalignment. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5944–5958, Seattle, United States, July 2022. Association for Computational Linguistics.
  208. Mind the gap: Assessing temporal generalization in neural language models. Advances in Neural Information Processing Systems, 34:29348–29363, 2021.
  209. Lifelong pretraining: Continually adapting language models to emerging corpora. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 1–16, virtual+Dublin, May 2022. Association for Computational Linguistics.
  210. Inductive learning on commonsense knowledge graph completion. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.
  211. Green ai. Communications of the ACM, 63(12):54–63, 2020.
  212. A survey on green deep learning. arXiv preprint arXiv:2111.05193, 2021.
  213. Green learning: Introduction, examples and outlook. Journal of Visual Communication and Image Representation, page 103685, 2022.
  214. GreenKGC: A lightweight knowledge graph completion method. arXiv preprint arXiv:2208.09137, 2022.
  215. Train big, then compress: Rethinking model size for efficient training and inference of transformers. In International Conference on Machine Learning, pages 5958–5968. PMLR, 2020.
  216. E-bert: A phrase and product knowledge enhanced language model for e-commerce. arXiv preprint arXiv:2009.02835, 2020.
  217. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
  218. A domain knowledge enhanced pre-trained language model for vertical search: Case study on medicinal products. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1014–1023, 2022.
  219. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6), 2022.
  220. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  221. The second workshop on knowledge graphs and semantics for text retrieval, analysis, and understanding (kg4ir). In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 1423–1426, 2018.
  222. Knowledge graph contrastive learning for recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1434–1443, 2022.
  223. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Chengwei Wei (17 papers)
  2. Yun-Cheng Wang (17 papers)
  3. Bin Wang (750 papers)
  4. C. -C. Jay Kuo (176 papers)
Citations (36)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets